1

From this Tutorial and Feature Importance
I try to make my own random forest tree

data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target


X = df.loc[:, df.columns != 'target']
y = df.loc[:, 'target'].values


X_train, X_test, Y_train, Y_test = train_test_split(X, y, random_state=0)


rf = RandomForestClassifier(n_estimators=1,
                            max_depth=2,
                            max_features=2,
                            random_state=0)
rf.fit(X_train, Y_train)
rf.feature_importances_
array([0.        , 0.11197953, 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.88802047, 0.        , 0.        , 0.        ])
fn=data.feature_names
cn=data.target_names
fig, axes = plt.subplots(nrows = 1,ncols = 1,figsize = (4,4), dpi=800)
tree.plot_tree(rf.estimators_[0],
               feature_names = fn, 
               class_names=cn,
               filled = True);
fig.savefig('rf_individualtree.png')

A single random forest tree calculate the Feature Importance by hand from above Feature Importance (result from sklearn 0.11197953, 0.88802047)

a = (192/265)*(0.262-(68/192)*0.452-(124/192)*0.103) 
b = (265/265)*(0.459-(192/265)*0.262-(73/265)*0.185)+(73/265)*(0.185-(72/73)*0.173)

print(b/(a+b))
print(a/(a+b))
0.8625754868011606
0.13742451319883947

Which part I did wrong my result is different from sklearn answer or sklearn just don't follow the formula?

Sergey Bushmanov
  • 23,310
  • 7
  • 53
  • 72

1 Answers1

0

You have couple of problems:

  1. Rounding error
  2. Math, specifically calculating probability of reaching a node

As soon as you correct them, you'll get the sklearn's result:

print(rf.estimators_[0].tree_.impurity)

array([0.45899182, 0.26172737, 0.10250188, 0.45244126, 0.18549346,
       0.17300567, 0.        ])

n1 = 0.45899182261015226 - (310/426)*0.26172736732570234 - (116/426)*0.1854934601664685
n2 = (116/426)*0.1854934601664685 - (115/426)*0.17300567107750475
n3 = (310/426)*0.26172736732570234 - (203/426)*0.10250188065713806 - (107/426)*0.45244126124552364
f1 = n1+n2
f2 = n3
print(f1/(f1+f2), f2/(f1+f2))

(0.888020474590027, 0.11197952540997297)

(You may read more on how importance is calculated here by package developers or here by reading the source code)

Note as well, what RandomForest considers important may be not so important for another model (and vice versa), i.e. "importance" here is model specific, and probably may be not so intuitively understandable or expected by people, who are more accustomed to linear explainability.

Sergey Bushmanov
  • 23,310
  • 7
  • 53
  • 72