2

Decision tree

In the above picture, the highest level samples is 6499, which are split into 3356 True and 3143 False. But if you follow the True path, it says there are 2644 samples. Why wouldn't there be 3356? All the samples seem to conflict with the results from the levels above.

I think I'm just misunderstanding what samples and value mean, but in case it's the code, here's the code of the graphing part:

dot_data = tree.export_graphviz(clf,
                                feature_names=columns[1:],
                                out_file=None,
                                filled=True,
                                rounded=True)
graph = pydotplus.graph_from_dot_data(dot_data)

colors = ('green', 'red')
edges = collections.defaultdict(list)

for edge in graph.get_edge_list():
    edges[edge.get_source()].append(int(edge.get_destination()))

for edge in edges:
    edges[edge].sort()    
    for i in range(2):
        dest = graph.get_node(str(edges[edge][i]))[0]
        dest.set_fillcolor(colors[i])

graph.write_png('tree.png')
jss367
  • 4,759
  • 14
  • 54
  • 76
  • 2
    On top of the accepted answer, you may find this answer also helpful: [What does scikit-learn DecisionTreeClassifier.tree_.value do?](https://stackoverflow.com/questions/47719001/what-does-scikit-learn-decisiontreeclassifier-tree-value-do/47719621#47719621) – desertnaut Jun 21 '18 at 10:23

1 Answers1

2

I think you're misunderstanding what value represents. Value seems to represent the number of instances of each class at that node of the tree, where samples is simply the sum total of all the instances in value at that node.

The value field tells you nothing about how those samples will be split up based on the outcome of the condition. You'll notice that at each node samples is equivalent to sum(value), and likewise each parent node's samples value is equivalent to the sum of the values of samples of each child node (e.g. 6499 == 2644 + 3855).

Mihai Chelaru
  • 7,614
  • 14
  • 45
  • 51