I follow this issue to calculate feature importance on decision tree: scikit learn - feature importance calculation in decision trees
However, I can't get correct value on calculating feature importance on random forest. For example: I use code like this and get random forest trees. I use sklearn package on python.
clf = RandomForestClassifier(n_estimators=2, max_features='log2')
clf.fit(X_train, y_train)
feature_imp = pd.Series(clf.feature_importances_, index = features_id).sort_values(ascending=False)
feature_imp_each_tree = [tree.feature_importances_.T for tree in clf.estimators_]
And then, I know feature importance #1: 0.1875, #2: 0.3313, #3: 0.4813
feature importance on each tree is
left tree #1: 0.375, #2: 05625, #3: 0.0625
right tree #1: 0, #2: 0.1, #3: 0.9
Therefore I follow step to calculate ...
feature #1 on left tree: (2/4)*(0.5-0-0)=0.25
feature #2 on left tree: (2/4)*(0.38-0-0)=0.38
feature #3 on left tree: (4/4)*(0.44-0.38*2/4-0.5*2/4)=0
feature #1 on right tree: 0
feature #2 on right tree: (5/5)*(0.28-0.38*3/5-0)=0.052
feature #3 on right tree: (3/5)*(0.38-0-0)=0.228
I know importance on random forest need to normalize to sum = 1 therefore after normalize (importance/sum)
sum = 0.25+0.38+0 = 0.44
feature #1 on left tree: 0.25/0.44 = 0.5682
feature #2 on left tree: 0.19/0.44 = 0.4318
feature #3 on left tree: 0/0.44 = 0
sum = 0+0.052+0.228 = 0.28
feature #1 on right tree: 0/0.28 = 0
feature #2 on right tree: 0.052/0.28 = 0.186
feature #3 on right tree: 0.228/0.28 = 0.814
and then calculate average:
feature #1: (0.5682+0)/2 = 0.2841
feature #2: (0.4318+0.186)/2 = 0.3088
feature #3: (0+0.814)/2 = 0.4071
Although sorting is correct but value isn't correct, please help me how to calculate that thanks