I have started using scikit-learn Decision Trees and so far it is working out quite well but one thing I need to do is retrieve the set of sample Y values for the leaf node, especially when running a prediction. That is given an input feature vector X, I want to know the set of corresponding Y values at the leaf node instead of just the regression value which is the mean (or median) of those values. Of course one would want the sample mean to have a small variance but I do want to extract the actual set of Y values and do some statistics/create a PDF. I have used code like this how to extract the decision rules from scikit-learn decision-tree? To print the decision tree but the output of the 'value' is the single float representing the mean. I have a large dataset so limit the leaf size to e.g. 100, I want to access those 100 values...
Asked
Active
Viewed 1,875 times
1
-
1You need something like this: http://scikit-learn.org/stable/auto_examples/tree/plot_unveil_tree_structure.html#sphx-glr-auto-examples-tree-plot-unveil-tree-structure-py – Vivek Kumar Jun 30 '17 at 09:56
-
2You can use `apply` to get the leaf ids of each sample; [see here.](https://stackoverflow.com/questions/38299015/getting-the-distribution-of-values-at-the-leaf-node-for-a-decisiontreeregressor/38318135#38318135) – Matt Hancock Jun 30 '17 at 11:09
-
Thank you for these replies. I coded this up and get the same mean as shown when exporting the tree with graph_viz, so that's good. However although compact it doesn't seem efficient. Effectively I fit the data to a tree and each leaf node will end up with a sub-set of samples. I am then iterating through the data a second time to record which leaf node it falls into so i can get the corresponding target. But that data should already be stored in the leaf node somewhere? It doesn't seem slow so maybe not worth worrying about the duplication. – user1978816 Jun 30 '17 at 19:02
-
1No, in the leaf, only means and counts are stored. I think, duplication is okay. – David Dale Nov 25 '17 at 17:02
1 Answers
1
another solution is to use an (undocumented?) feature of the sklearn DecisionTreeRegressor object which is .tree.impurity it returns the standard deviation of the values per each leaf

amitant
- 11
- 2
-
While this might be a valuable hint to solve the problem, a good answer also demonstrates the solution. Please [edit] to provide example code to show what you mean. Alternatively, consider writing this as a comment instead. – Toby Speight Aug 10 '17 at 08:00