scikit-learn Decision trees Regression: retrieve all samples for leaf (not mean)

Question

I have started using scikit-learn Decision Trees and so far it is working out quite well but one thing I need to do is retrieve the set of sample Y values for the leaf node, especially when running a prediction. That is given an input feature vector X, I want to know the set of corresponding Y values at the leaf node instead of just the regression value which is the mean (or median) of those values. Of course one would want the sample mean to have a small variance but I do want to extract the actual set of Y values and do some statistics/create a PDF. I have used code like this how to extract the decision rules from scikit-learn decision-tree? To print the decision tree but the output of the 'value' is the single float representing the mean. I have a large dataset so limit the leaf size to e.g. 100, I want to access those 100 values...

You need something like this: http://scikit-learn.org/stable/auto_examples/tree/plot_unveil_tree_structure.html#sphx-glr-auto-examples-tree-plot-unveil-tree-structure-py — Vivek Kumar, Jun 30 '17 at 09:56
You can use `apply` to get the leaf ids of each sample; [see here.](https://stackoverflow.com/questions/38299015/getting-the-distribution-of-values-at-the-leaf-node-for-a-decisiontreeregressor/38318135#38318135) — Matt Hancock, Jun 30 '17 at 11:09
Thank you for these replies. I coded this up and get the same mean as shown when exporting the tree with graph_viz, so that's good. However although compact it doesn't seem efficient. Effectively I fit the data to a tree and each leaf node will end up with a sub-set of samples. I am then iterating through the data a second time to record which leaf node it falls into so i can get the corresponding target. But that data should already be stored in the leaf node somewhere? It doesn't seem slow so maybe not worth worrying about the duplication. — user1978816, Jun 30 '17 at 19:02
No, in the leaf, only means and counts are stored. I think, duplication is okay. — David Dale, Nov 25 '17 at 17:02

score 1 · Answer 1 · answered Aug 10 '17 at 07:31

1

another solution is to use an (undocumented?) feature of the sklearn DecisionTreeRegressor object which is .tree.impurity it returns the standard deviation of the values per each leaf

answered Aug 10 '17 at 07:31

amitant

11
2

While this might be a valuable hint to solve the problem, a good answer also demonstrates the solution. Please [edit] to provide example code to show what you mean. Alternatively, consider writing this as a comment instead. – Toby Speight Aug 10 '17 at 08:00

scikit-learn Decision trees Regression: retrieve all samples for leaf (not mean)

1 Answers1

Linked