-1

In one of my projects, I was trying to determine which of my 12 features are the most driving factors against a target variable using RandomForestRegressor(sklearn). RandomForest nicely gives you a list of feature importances that explains which of the features is best used to explain the target. But I'm still unsure about what should be the max_features for my model because the default answer is use all features which would mean my model is just bagged ensemble of trees. After going through some discussions , it made sense to use n/3 as maximum number of features if you really looking for a random forest of trees. I continued with n/3 as maximum number of features because I was getting pretty good r-square.
Very recently I realized that my feature importances are completely different when I changed the max_features to n. If feature importances are really relative to each other on a scale of 1-10, can it really increment/does it make sense to increment from 0.36 to 0.81 when I change number of features from n/3 to n? So what should be the max_features if I'm trying to determine the most explanatory variable given that I'm getting pretty good r-square with both n/3 and n. I'm unable to figure out what I'm missing.enter image description herePlease suggest how to proceed. Thank you very much.

ThReSholD
  • 668
  • 10
  • 15

1 Answers1

1

Yes.

First scenario:

Assume that there are two features feat1, and feat2 which provide the same type of information to the model. Now if both are present in the data, and the model picks one first, the importance of feat1 will be large. Now the model analyzes the second feature feat2 and concludes that it doesn't provide any significant increase in knowledge than already provided by feat1. So the importance of feat2 will be relatively small.

Second scenario:

You changed the max_features to n/3 and somehow feat1 is now not considered. So the information provided by feat2 is now greater than before. So its importance can increase significantly.

Note that this is for a single model. I don't know how it affects the whole ensemble. And maybe you will be able to get more details on https://stats.stackexchange.com.

Luca Mastrostefano
  • 3,201
  • 2
  • 27
  • 34
Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
  • In your first scenario, why is model(every decision tree) picking up feat1 first everytime before feat2? In the second scenario, isn't there a case where feat2 isn't considered and feat1 is explaining the information that feat2 can explain? My question is more of "How are these values really feature importances on a relative scale (given all of them sum to 1) if they keep changing drastically with change in number of features?" – ThReSholD May 02 '18 at 15:59
  • 1
    @ThReSholD The averaging is done at the end. Secondly, the feature_importance is calculated after using all samples, not sample by sample. – Vivek Kumar May 03 '18 at 01:31