I want to evaluate for all samples the size of the leaf node they fell into.
Based on this excellent answer, I already figured out a way to extract the number of samples for each leaf node:
from sklearn.tree import _tree, DecisionTreeClassifier
import numpy as np
clf = DecisionTreeClassifier().fit(X_train, y_train)
def tree_get_leaf_size_for_elem(tree, feature_names):
tree_ = tree.tree_
def recurse(node):
if tree_.feature[node] != _tree.TREE_UNDEFINED:
recurse(tree_.children_left[node])
else:
samples_in_leaf = np.sum(tree_.value[node][0])
recurse(0)
tree_get_leaf_size_for_elem(clf, feature_names)
Is there a way to get the indices of all samples (X_train
) that ended up in a leaf node? A new column for X_train
called "leaf_node_size" would be the desired output.