0
from sklearn import tree
import graphviz
import shap

X,y = shap.datasets.boston()

clf = tree.DecisionTreeRegressor(max_depth=2).fit(X, y)

gives us the following tree:

enter image description here

The values are confusing to me, I understand that the values at leaves are the predictions once that leaf is reached. However what do the values at nodes represent?

I found a few SO posts/documentation for Classification but not for regression.

EDIT: Thinking about if further I see that they're most likely just the values of those bins if the tree was cut short. Not sure why exactly they're used in SHAP though.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Alexis Drakopoulos
  • 1,115
  • 7
  • 22

1 Answers1

0

Let's focus in one node, for example:

enter image description here

  • X12<= 14.4 refers to the next split you will apply to your data. In this case you will use the feature X 12.
  • Samples= 430, refers to all training samples that are in this node. Check that the root node is 506 (which is the sum of his son nodes (430+76))
  • If we make a prediction at this internal node, we will predict that the values is value=19.934 and we will be committing an mse= 40.273, which refers to the error.

Obviously, when we are splitting data with more nodes, we are reducing the number of samples and of course the mse, since we are narrowing down. The value vary since we are being more precis.

About shap, you are only using this library to import the dataset, nothing more. You could have imported the data without using the shap library. There are different ways to import Boston data, for example using Sklearn:

from sklearn.datasets import load_boston
X, y = load_boston(return_X_y=True)

Anyway, you should check if it's the exact dataset (e.g., one data set have more samples).

Alex Serra Marrugat
  • 1,849
  • 1
  • 4
  • 14