Classification tree in sklearn giving inconsistent answers

Question

I am using a classification tree from sklearn and when I have the the model train twice using the same data, and predict with the same test data, I am getting different results. I tried reproducing on a smaller iris data set and it worked as predicted. Here is some code

from sklearn import tree
from sklearn.datasets import iris

clf = tree.DecisionTreeClassifier()
clf.fit(iris.data, iris.target)
r1 = clf.predict_proba(iris.data)

clf.fit(iris.data, iris.target)
r2 = clf.predict_proba(iris.data)

r1 and r2 are the same for this small example, but when I run on my own much larger data set I get differing results. Is there a reason why this would occur?

EDIT After looking into some documentation I see that DecisionTreeClassifier has an input random_state which controls the starting point. By setting this value to a constant I get rid of the problem I was previously having. However now I'm concerned that my model is not as optimal as it could be. What is the recommended method for doing this? Try some randomly? Or are all results expected to be about the same?

score 12 · Accepted Answer · answered Jan 28 '14 at 00:19

The DecisionTreeClassifier works by repeatedly splitting the training data, based on the value of some feature. The Scikit-learn implementation lets you choose between a few splitting algorithms by providing a value to the splitter keyword argument.

"best" randomly chooses a feature and finds the 'best' possible split for it, according to some criterion (which you can also choose; see the methods signature and the criterion argument). It looks like the code does this N_feature times, so it's actually quite like a bootstrap.
"random" chooses the feature to consider at random, as above. However, it also then tests randomly-generated thresholds on that feature (random, subject to the constraint that it's between its minimum and maximum values). This may help avoid 'quantization' errors on the tree where the threshold is strongly influenced by the exact values in the training data.

Both of these randomization methods can improve the trees' performance. There are some relevant experimental results in Lui, Ting, and Fan's (2005) KDD paper.

If you absolutely must have an identical tree every time, then I'd re-use the same random_state. Otherwise, I'd expect the trees to end up more or less equivalent every time and, in the absence of a ton of held-out data, I'm not sure how you'd decide which random tree is best.

See also: Source code for the splitter

score 10 · Answer 2 · answered Oct 22 '17 at 13:37

The answer provided by Matt Krause does not answer the question entirely correctly.

The reason for the observed behaviour in scikit-learn's DecisionTreeClassifier is explained in this issue on GitHub.

When using the default settings, all features are considered at each split. This is governed by the max_features parameter, which specifies how many features should be considered at each split. At each node, the classifier randomly samples max_features without replacement (!).

Thus, when using max_features=n_features, all features are considered at each split. However, the implementation will still sample them at random from the list of features (even though this means all features will be sampled, in this case). Thus, the order in which the features are considered is pseudo-random. If two possible splits are tied, the first one encountered will be used as the best split.

This is exactly the reason why your decision tree is yielding different results each time you call it: the order of features considered is randomized at each node, and when two possible splits are then tied, the split to use will depend on which one was considered first.

As has been said before, the seed used for the randomization can be specified using the random_state parameter.

I thought that was covered by "randomly chooses a feature" and "chooses a feature at random" in my answer. — Matt Krause, Nov 21 '17 at 13:50
Kind of, but not entirely. As a user, you *expect* there to be random behaviour when splitting using `splitter=random`, but probably not so much when using `splitter=best`. That's the main question to be answered. The randomization, here, comes from the fact that, even when `max_features=n_features`, they are sampled at random (without replacement). In your answer, you state that '"best" randomly chooses a feature and finds the 'best' possible split for it', even though that's not the case: it considers up the `max_features` random features, and chooses the best possible split. — engelen, Nov 22 '17 at 09:26

score 0 · Answer 3 · answered May 15 '18 at 19:25

The features are always randomly permuted at each split. Therefore, the best found split may vary, even with the same training data and max_features=n_features, if the improvement of the criterion is identical for several splits enumerated during the search of the best split. To obtain a deterministic behaviour during fitting, random_state has to be fixed.

Source: http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier#Notes

score -1 · Answer 4 · answered Jan 27 '14 at 21:01

-1

I don't know anything about sklearn but...

I guess DecisionTreeClassifier has some internal state, create by fit, which only gets updated/extended.

You should create a new one?

answered Jan 27 '14 at 21:01

Karoly Horvath

94,607
11
117
176

Classification tree in sklearn giving inconsistent answers

4 Answers4

Linked