0

Suppose I have the following DecisionTreeClassifier model:

from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_breast_cancer

bunch = load_breast_cancer()

X, y = bunch.data, bunch.target

model = DecisionTreeClassifier(random_state=100)
model.fit(X, y)

I want to traverse each node (both leaf and decision) in this tree and determine how the predicted value changes as the tree is traversed. Basically I'd like to be able to tell, for given sample, how that ultimate prediction (what's returned by .predict) is determined. So maybe the sample is predicted 1 ultimately, but traverses four nodes and at each node its "constant" (language used in the scikit docs) prediction goes from 1 to 0 to 0 to 1 again.

It's not immediately apparent how I'd get that information from model.tree_.value, which is described as:

 |  value : array of double, shape [node_count, n_outputs, max_n_classes]
 |      Contains the constant prediction value of each node.

And looks like, in the case of this model:

>>> model.tree_.value.shape
(43, 1, 2)
>>> model.tree_.value
array([[[212., 357.]],

       [[ 33., 346.]],

       [[  5., 328.]],

       [[  4., 328.]],

       [[  2., 317.]],

       [[  1.,   6.]],

       [[  1.,   0.]],

       [[  0.,   6.]],

       [[  1., 311.]],

       [[  0., 292.]],

       [[  1.,  19.]],

       [[  1.,   0.]],

       [[  0.,  19.]],

Does anyone know how I could accomplish this? Would the class prediction for each of the 43 nodes above just be the argmax of each list? So 1, 1, 1, 1, 1, 1, 0, 0, ..., going from top to bottom above?

blacksite
  • 12,086
  • 10
  • 64
  • 109

1 Answers1

1

One solution could be to directly walk to the decision path in the tree. You could adapt this solution that prints the whole decision tree as if clauses. Here is a quick adaptation to explain one instance:

def tree_path(instance, values, left, right, threshold, features, node, depth):
    spacer = '    ' * depth
    if (threshold[node] != _tree.TREE_UNDEFINED):
        if instance[features[node]] <= threshold[node]:
            path = f'{spacer}{features[node]} ({round(instance[features[node]], 2)}) <= {round(threshold[node], 2)}'
            next_node = left[node]
        else:
            path = f'{spacer}{features[node]} ({round(instance[features[node]], 2)}) > {round(threshold[node], 2)}'
            next_node = right[node]
        return path + '\n' + tree_path(instance, values, left, right, threshold, features, next_node, depth+1)
    else:
        target = values[node]
        for i, v in zip(np.nonzero(target)[1],
                        target[np.nonzero(target)]):
            target_count = int(v)
            return spacer + "==> " + str(round(target[0][0], 2)) + \
                   " ( " + str(target_count) + " examples )"

def get_path_code(tree, feature_names, instance):
    left      = tree.tree_.children_left
    right     = tree.tree_.children_right
    threshold = tree.tree_.threshold
    features  = [feature_names[i] for i in tree.tree_.feature]
    values = tree.tree_.value
    return tree_path(instance, values, left, right, threshold, features, 0, 0)

# print the decision path of the first intance of a panda dataframe df
print(get_path_code(tree, df.columns, df.iloc[0]))
blacksite
  • 12,086
  • 10
  • 64
  • 109
  • I've already made a function that does exactly this. I'm really just looking for advice as to whether taking the dominant class at each node is an okay strategy for reporting how predictions change as a particular path in a tree is traversed. – blacksite Nov 27 '18 at 13:34
  • Ok. Perhaps, you could sum the number of target per class that are associated to each leaf under a node. For instance you could have at one node: Class A (18 target), Class B (10 target), which could be a clue? – guillaume ERETEO Nov 28 '18 at 14:12