sklearn.tree.DecisionTreeClassifier: Get all samples that fell into leaf node

Question

I want to evaluate for all samples the size of the leaf node they fell into.

Based on this excellent answer, I already figured out a way to extract the number of samples for each leaf node:

from sklearn.tree import _tree, DecisionTreeClassifier
import numpy as np

clf = DecisionTreeClassifier().fit(X_train, y_train)

def tree_get_leaf_size_for_elem(tree, feature_names):

    tree_ = tree.tree_

    def recurse(node):
        if tree_.feature[node] != _tree.TREE_UNDEFINED:
            recurse(tree_.children_left[node])
        else:
            samples_in_leaf = np.sum(tree_.value[node][0])

    recurse(0)

tree_get_leaf_size_for_elem(clf, feature_names)

Is there a way to get the indices of all samples (X_train) that ended up in a leaf node? A new column for X_train called "leaf_node_size" would be the desired output.

score 2 · Accepted Answer · edited Jan 17 '19 at 12:51

2

sklearn allows you to do this easily through the apply method

from collections import Counter

#get the leaf for each training sample
leaves_index = tree.apply(X_train) 

#use Counter to find the number of elements on each leaf
cnt = Counter( leaves_index )

#and now you can index each input to get the number of elements
elems = [ cnt[x] for x in leaves_index ]

edited Jan 17 '19 at 12:51

citizenfour

60
1
7

answered Oct 31 '18 at 09:14

Gabriel M

1,486
4
17
25

once again I am stunned by the simplicity and straight-forwardness of the proposed solution :) – Boern Oct 31 '18 at 09:19

sklearn.tree.DecisionTreeClassifier: Get all samples that fell into leaf node

1 Answers1