Why does this decision tree's values at each step not sum to the number of samples?

Question

I'm reading about decision trees and bagging classifiers, and I'm trying to show the first decision tree that is used in the bagging classifier. I'm confused about the output.

from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons
from sklearn.ensemble import BaggingClassifier
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
from graphviz import Source

X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

bag_clf = BaggingClassifier(
    DecisionTreeClassifier(), 
    n_estimators=500,
    max_samples=100, 
    bootstrap=True, 
    n_jobs=-1)
bag_clf.fit(X_train, y_train)

Source(tree.export_graphviz(bag_clf.estimators_[0], out_file=None))

Here's a snippet out of the output

It's been my understanding that the value is supposed to show how many of the samples are classified as each category. In that case, shouldn't the numbers in the value field add up to the samples field? Why is that not the case here?

desertnaut · Accepted Answer · 2022-05-05T00:06:19.803

Nice catch.

It would seem that the extra bootstrap samples are included in the value, but not in the total samples; repeating your code verbatim but changing to bootstrap=False eliminates the discrepancy:

The behavior is similar in Random Forest, both classifier and regressor - see respectively:

Why the sum "value" isn't equal to the number of "samples" in scikit-learn RandomForestClassifier?
sklearn RandomForestRegressor discrepancy in the displayed tree values

score 1 · Answer 2 · answered May 13 '19 at 00:51

Interesting find.

I did some dig around and found that the bootstrapping switches on the proportion = True switch while exporting the graphviz object. Since there is possibility of same sample passing through the decision tree more than once, it is expressed in percentage terms. If bootstrapping = False, the sample goes through only once and hence it can be expressed as sample counts on each classes.

Why does this decision tree's values at each step not sum to the number of samples?

2 Answers2

Linked