is there any way to get samples under each leaf of a decision tree?

Question

I have trained a decision tree using a dataset. Now I want to see which samples fall under which leaf of the tree.

From here I want the red circled samples.

I am using Python's Sklearn's implementation of decision tree .

This: https://stackoverflow.com/questions/32506951/how-to-explore-a-decision-tree-built-using-scikit-learn and this: https://stackoverflow.com/questions/20224526/how-to-extract-the-decision-rules-from-scikit-learn-decision-tree/42227468#42227468 may be relevant. — Miriam Farber, Jul 30 '17 at 10:40

Maximilian Peters · Accepted Answer · 2019-09-05T14:49:11.953

12

If you want only the leaf for each sample you can just use

clf.apply(iris.data)

array([ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 14, 5, 5, 5, 5, 5, 5, 10, 5, 5, 5, 5, 5, 10, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 16, 16, 16, 16, 16, 16, 6, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 8, 16, 16, 16, 16, 16, 16, 15, 16, 16, 11, 16, 16, 16, 8, 8, 16, 16, 16, 15, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16])

If you want to get all samples for each node you could calculate all the decision paths with

dec_paths = clf.decision_path(iris.data)

Then loop over the decision paths, convert them to arrays with toarray() and check whether they belong to a node or not. Everything is stored in a defaultdict where the key is the node number and the values are the sample number.

for d, dec in enumerate(dec_paths):
    for i in range(clf.tree_.node_count):
        if dec.toarray()[0][i] == 1:
            samples[i].append(d)

Complete code

import sklearn.datasets
import sklearn.tree
import collections

clf = sklearn.tree.DecisionTreeClassifier(random_state=42)
iris = sklearn.datasets.load_iris()
clf = clf.fit(iris.data, iris.target)

samples = collections.defaultdict(list)
dec_paths = clf.decision_path(iris.data)

for d, dec in enumerate(dec_paths):
    for i in range(clf.tree_.node_count):
        if dec.toarray()[0][i] == 1:
            samples[i].append(d)

Output

print(samples[13])

[70, 126, 138]

edited Sep 05 '19 at 14:49

answered Jul 30 '17 at 11:00

Maximilian Peters

30,348
12
86
99

print(samples[13]) here what does this 13 represent ? and does the output [70, 126, 138] means the indexes of the feature vectors ? – Farshid Rayhan Jul 30 '17 at 16:11
`13` is the node number – Maximilian Peters Jul 30 '17 at 16:19
Can I get the decision path of a **test** sample, not **training samples**? – Alaa M. Apr 26 '19 at 17:17
1

@AlaaM. you could run `clf.decision_path(my_test_samples)` and you should get the decision path for those samples. – Maximilian Peters Apr 26 '19 at 17:31
@MaximilianPeters - That's good, thanks. Do you know a way to get a `dot` element though? – Alaa M. Apr 26 '19 at 18:04
1

@AlaaM. have a look at this answer: https://stackoverflow.com/a/43218264/2776376, if you pass in one sample you could color all nodes which have one sample and you can visualize the decision for this particular sample. – Maximilian Peters Apr 26 '19 at 18:24
@MaximilianPeters - I'm not sure what you mean by "pass in one sample". `export_graphviz` gets a fitted classifier. So at that point the classifier is only aware of the training samples. – Alaa M. Apr 26 '19 at 18:37
@AlaaM.Perhaps best to post it as a new question? – Maximilian Peters Apr 26 '19 at 20:25
Hey! I am trying your code, but my plotted decision tree (of a random forest) has 180 samples in node 0 and 29 nodes. While your code returns 23 sample[i] and 651 samples. Am I missing something? – Noob Programmer Nov 05 '21 at 10:40
@NoobProgrammer: Can you open a new question with your code? I have not tried the code on random forests. – Maximilian Peters Nov 05 '21 at 10:49
@MaximilianPeters https://stackoverflow.com/questions/69852142/extracting-samples-indices-of-decision-trees-in-random-forest – Noob Programmer Nov 05 '21 at 10:55

is there any way to get samples under each leaf of a decision tree?

1 Answers1