Feature Importance in Random Forests
Comparing Gini and Accuracy metrics
We’re following up on Part I where we explored the Driven Data blood donation data set. The objective of the present article is to explore feature engineering and assess the impact of newly created features on the predictive power of the model in the context of this dataset. The random forest model provides an easy way to assess feature importance. Depending on the library at hand, different metrics are used to calculate feature importance. We compare the Gini metric used in the R random forest package with the Permutation metric used in scikit-learn.
Feature Engineering consists in creating new predictors from the original set of data or from external sources in order to extract or add information that was not available to the model in the original feature set.
Feature Engineering is an art in itself. It keeps the human savoir-faire in the field of Data Science. Adding new features can results in worse performance and cross correlation between features can hinder feature importance interpretation.
Feature Selection consists in reducing the number of predictors. A technique particularly important when the feature space is large and computational performance issues are induced.
Feature selection can
- Improve the performance prediction of the model (by removing predictors with ‘negative’ influence for instance)
- Provide faster and more cost effective implementations in contexts where datasets have thousands or hundred of thousands of variables.
There are many Feature Selection techniques and algorithms which are all based on some form of assessment of the importance of each feature.
There are 3 ways of assessing the importance of features with regard to the model predictive powers:
- Filter: Features are filtered out independently of the model through criterias on their own properties (correlation with the target for instance).
- Wrapper: “In essence, wrapper methods are search algorithms that treat the predictors as the inputs and utilize model performance as the output to be optimized.” see for instance Recursive Feature Elimination
- Embedded: “Built-in feature selection typically couples the predictor search algorithm with the parameter estimation and are usually optimized with a single objective function.” see feature Importance - Caret
Feature importance is also used as a way to establish a ranking of the predictors (Feature Ranking).
Questions and Answers
Besides the obvious question on how to actually engineer new features, some of the main questions around feature engineering resolve around the impact of the new features on the model.
- Will adding a predictor that has no (or has some) significative impact by itself to the set of features actually improve the model?
- Can removing features with low importance hurt the model?
- What is the result of adding highly correlated features into the feature space?
- Knowing that there are many different ways to assess feature importance, even within a model such as Random Forest, do assessment vary significantly across different metrics ?
- Can feature importance be overfitted to the training set on which it was assessed?
These questions have been addressed for the most part in the litterature.
Guyon and Elisseeff An introduction to variable and feature selection - pdf have shown that
- “Better class separation can be obtained by adding variables that are presumably redundant (highly correlated)”
- “A variable that is completely useless by itself can provide a significant performace improvement when taken with others”
- “2 variables that are useless by themselves can be useful together.”
Many studies of feature importance with tree based models assume the independance of the predictors. See Zhu et al. Reinforcement learning trees and Scornet et al. Consistency of random forests - pdf for instance.
However, empirical results show that size and correlation factor both decrease the Feature Importance.
There is no doubt that Feature correlation has an impact on Feature Importance!
In a recent article Correlation and variable importance in random forests, Gregorutti et al. carry out an extensive analysis of the influence of feature correlation on feature importance. They apply their findings to the Recursive Feature Elimination (RFE) algorithm for two types of feature importance measurement in Random Forests: Gini and Permutation.
Main findings are:
- Both Gini and Permutation importance are less able to detect relevant variables when correlation increases
- The higher the number of correlated features the faster the permutation importance of the variables decreases to zero
Correlation of features tends to blur the discrimination between features.
Gini vs Permutation
Several measures are available for feature importance in Random Forests:
Gini Importance or Mean Decrease in Impurity (MDI) calculates each feature importance as the sum over the number of splits (accross all tress) that include the feature, proportionaly to the number of samples it splits.
Permutation Importance or Mean Decrease in Accuracy (MDA) is assessed for each feature by removing the association between that feature and the target. This is achieved by randomly permuting the values of the feature and measuring the resulting increase in error. The influence of the correlated features is also removed.
The scikit-learn Random Forest Library implements the Gini Importance.
The R Random Forest package implements both the Gini and the Permutation importance. In the case of classification, the R Random Forest package also shows feature performance for each class. This is useful to detect features that would degrade performace for a specific class while being positive on average.
See Gilles Louppe PhD dissertation for a very clear exposé of these metrics, their formal analysis as well as R and scikit learn implementation details.
Back to Blood Donations
In the context of the blood donation dataset, the original number of features is very limited. 3 in total: Recency, Frequency and Time. see Part I for an explanation of these variables.
From the reduced number of available features, we try to engineer new features to improve the predicting power of our Random Forest model.
We try different sets of new features and measure their impact on cross validation scores using different metrics (logLoss, AUC and Accuracy).
We discuss the influence of correlated features on feature importance. Since we are only creating features from the original set, many new features will be will have high cross-correlation .
Our setup is the following. The code can be found here
- We use R Caret and Random Forest Packages with logLoss.
- 20% of the train data set is set aside as a hold out dataset for final model evaluation.
- We run the simulations 10 times with different seeds to average over different hold out sets and avoid artefacts particular to specific held out samples.
For each seed / simulation:
- We carry out a 10 fold validation repeated 10 times for cross validation.
- We use the Caret package for cross validation and to optimize the random forest with respect to the number of splits (mtry)
- We train a random forest model (RandomForest R package not Caret) with the train set and the mtry value obtained previously.
- We calculate the Accuracy, AUC and logLoss scores for the test set.
- We record the feature importance for both the Gini Importance (MDI) and the Permutation Importance (MDA).
Our different sets of features are
Baseline: The original set of features: Recency, Frequency and Time
Set 1: We take the log, the sqrt and the square of each original feature
Set 2: Ratios and multiples of the original set
- Set 3: Domain Knowledge. We interpret the original feature set to try to segment the population according to different donor behavior.
- Single time donors (144 people) are people for whom Recency = Time
- Regular donors are people who have given at least once every N month for longer than 6 months. They are assigned their average number of donations
Combinations We combine these different sets: 1 + 2, 1 + 3, 2 + 3 and 1 + 2 + 3.
- Most Important Features: Finally we define the set of features composed of only the most important features (MDI and MDA). Most important being the features whose importance is above the median of all features.
Baseline: original features
For the first 3 features original features we have the following scores:
| mtry | Acc | AUC | logLoss | logLossCV ---------- | ---- | ------ | ------- | ------- | ------- Baseline | 1 | 0.7719 | 0.6950 | 0.9830 | 0.9413
Feature Importance as computed via the Random Forest package on the held our set is:
| Permutation | Gini ---------- |:----:|:------:| Recency | 0.0301 | 32.21 Frequency | 0.0540 | 34.08 Time | 0.0324 | 40.00
We see that the feature importance is different between Gini which has Time as the most important feature and Permutation which has Frequency as the most important Feature.
It is worthwhile to note that Frequency and Time are correlated (0.61) which could explain why Gini picked one feature and Permutation the other.
Set 1: Log, sqrt, square
And we notice a significant improvement on the logLoss metrics
| mtry | Accuracy | AUC | logLoss | logLossCV :---------- |:------:|:------:|:-------:|:-------:|:-----------------:| Baseline | 1 | 0.7719 | 0.6950 | 0.9830 | 0.9413 Set 1 | 1 | 0.7605 | 0.6905 | 0.8665 | 0.9139
As can be seen Feature importance is now divided among the original feature and the 3 derived ones. It’s as if the information included in the original feature (time for instance) was now spread out among all 4 variants of that feature (Time, sqTime, logTime and sqrtTime)
However, now Gini and Permutation have the same top 5 features based on Time although in a different order and with different weights.
|Permutation Feature||Importance||-||Gini Feature||Importance|
(The weights have been scaled)
Some features are very correlated although not among the same original feature derivation.
We ran the simulation for all sets taken separately and combined. The final scores are:
|Set 2 + 3||0.7561||0.6888||1.1221||0.9121||3.7|
|Set 1 + 3||0.7754||0.7028||0.9896||0.9231||1.0|
|Set 1 + 2||0.7544||0.6866||0.9981||0.9066||4.2|
Most Important Features:
We averaged 10 different simulations taking different seeds each time. Hence the decimal value of mtry. logLoss is obtained on the hold out set while logLossCV is obtained during Cross validation. Accuracy and AUC are calculated on the hold out set. The Gini (resp.Permutation) set consisted in taking the features whose importance was above median feature importance.
We can make the following observations on logLoss score:
- Set 1 improves the model both on the hold out set (logLoss) and the CV score (logLossCV)
- Set 2 and Set 3 do not. They both perform a bit worth than the baseline
- No combination or selection of features improves the logLoss on the hold out set
- The best logLossCV is obtained by combining all sets
- The last set (Imp Permutation) composed of the most important features assessed via Permutation beats the benchmark for the cross validation logLossCV.
Acc and AUC
No significant impact on Accuracy or AUC from any of the sets or their combination or selections. The Differences are within 1 / 2% of the original feature set.
In terms of feature importance, Gini and Permutation are very similar in the way they rank features.
- 4 out of the top 5 features are the same
- features with lowest importance are the same
- Accuracy gives more importance to the 2 lowest important feature than Gini
The impact of this difference can be observed in the difference between the Permutation most important and Gini most important: Gini requires higher level of mtry (5.3 vs 1.8) (mtry is averaged over 10 different sun / seed, hence the decimal). Gini needs to capture higher levels of feature interactions. Gini struggles more it would seem. Permutation seems to capture importance better although this difference is small.
However when selecting the most important features for Gini and Permutation the test set logLoss is comparable.
Overfitting One noticeable thing is the difference between logLoss and logLossCV, i.e. the logLoss on the hold out set and the logLoss obtained during Cross Validation.
The cases where the reduction in logLossCV is not matched by a reduction in logLoss probably indicates over fitting of the training set. For instance when aggregating all sets (Set 1+2+3) logLoss is roughly equal to the baseline logLoss while logLossCV is reduced by 7.5%.
Although we did not end up with a major improvement on the original score by adding newly engineered features, some interesting phenomenon were observable.
1) Correlation between predictors diffuses feature importance.
This is obvious when we use Set 1 and observe that the most important feature (Time for Gini and Frequency for Permutation) is now divided between the 3 new features (logTime, sqrtTime and sqTime).
2) The effects of feature set combination on the held out set score look very linear: A better set associated with a worse set ends up with an average score. (see Set 1 + 2 and 1 + 2)
3) However, these non linear effects of feature combinations are visible on the Cross validation Score. For instance the score of sets 1 and 2 is better than the score for either Set 1 or Set 2
4) Feature ranking and relative weights end up being very similar when used to select a subset of most important features. Non significant difference can be observed on the model’s predictive power.
If you liked this post, please share it on twitter And leave me your feedback, questions, comments, suggestions below. Much appreciated :)