1

I follow this issue to calculate feature importance on decision tree: scikit learn - feature importance calculation in decision trees

However, I can't get correct value on calculating feature importance on random forest. For example: I use code like this and get random forest trees. I use sklearn package on python.

clf = RandomForestClassifier(n_estimators=2, max_features='log2')
clf.fit(X_train, y_train)
feature_imp = pd.Series(clf.feature_importances_, index = features_id).sort_values(ascending=False)
feature_imp_each_tree = [tree.feature_importances_.T for tree in clf.estimators_]

enter image description here enter image description here

And then, I know feature importance #1: 0.1875, #2: 0.3313, #3: 0.4813

feature importance on each tree is

left tree #1: 0.375, #2: 05625, #3: 0.0625 right tree #1: 0, #2: 0.1, #3: 0.9

Therefore I follow step to calculate ...

feature #1 on left tree: (2/4)*(0.5-0-0)=0.25 feature #2 on left tree: (2/4)*(0.38-0-0)=0.38 feature #3 on left tree: (4/4)*(0.44-0.38*2/4-0.5*2/4)=0

feature #1 on right tree: 0 feature #2 on right tree: (5/5)*(0.28-0.38*3/5-0)=0.052 feature #3 on right tree: (3/5)*(0.38-0-0)=0.228

I know importance on random forest need to normalize to sum = 1 therefore after normalize (importance/sum)

sum = 0.25+0.38+0 = 0.44 feature #1 on left tree: 0.25/0.44 = 0.5682 feature #2 on left tree: 0.19/0.44 = 0.4318 feature #3 on left tree: 0/0.44 = 0

sum = 0+0.052+0.228 = 0.28 feature #1 on right tree: 0/0.28 = 0 feature #2 on right tree: 0.052/0.28 = 0.186 feature #3 on right tree: 0.228/0.28 = 0.814

and then calculate average:

feature #1: (0.5682+0)/2 = 0.2841 feature #2: (0.4318+0.186)/2 = 0.3088 feature #3: (0+0.814)/2 = 0.4071

Although sorting is correct but value isn't correct, please help me how to calculate that thanks

suzume
  • 21
  • 1

0 Answers0