2

I have the following code from scikit-learn website:

import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
iris = load_iris()
for i in range(10):
    clf = DecisionTreeClassifier()    
    a = cross_val_score(clf, iris.data, iris.target, cv=10)
    clf2 = DecisionTreeClassifier()    
    b = cross_val_score(clf2, iris.data, iris.target, cv=10)
    if not np.array_equal(a,b):
        print 'diff'
        print a
        print b
        break

it sometimes prints the difference, so I guess its not deterministic which is very strange.

Shevach
  • 717
  • 9
  • 25

2 Answers2

1

Decision trees ARE deterministic, they calculate the leaves/probabilities on the same raw dataset. If you were to use something like a random forest, then that would not be deterministic because it is randomly selecting variables. The problem is in your test for equality. You are testing using

object1 == object2

Python does not natively know how to compare the type of DecisionTreeClassifier. Are you asking if has the same values assigned to its properties? Do you want to know if the memory size is the same? Are dt and dt2 pointers that reference the same object? There isn't a way to know from what you have written. A better test would be to train the models and use the .predict() method on the same data. Are all the results the same every time? Then you probably have a deterministic classifier.

The "pythonic" way is to define an __ eq __ method in the class file. If you look here you'll see that there isn't one in the tree class - I haven't looked further, but I doubt that they've define that method. (Checking whether two classifier models are equivalent is not a common thing to do ).

Community
  • 1
  • 1
flyingmeatball
  • 7,457
  • 7
  • 44
  • 62
1

I have found that the DecisionTreeClassifier uses a random seed if random_state parameter isn't specified, as noted here:

random_state : int, RandomState instance or None, optional (default=None) If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

Fixed code: (I have added random_state=0 to DecisionTreeClassifier constructor)

import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
iris = load_iris()
for i in range(10):
    clf = DecisionTreeClassifier(random_state=0)    
    a = cross_val_score(clf, iris.data, iris.target, cv=10)
    clf2 = DecisionTreeClassifier(random_state=0)    
    b = cross_val_score(clf2, iris.data, iris.target, cv=10)
    if not np.array_equal(a,b):
        print 'diff'
        print a
        print b
        break

This works as expected and the np.array_equal(a,b)==True always.

joelostblom
  • 43,590
  • 17
  • 150
  • 159
Shevach
  • 717
  • 9
  • 25
  • 2
    Jack, I'm glad you found the `random_state` parameter, but as far as I'm aware there is no randomness in the method if `max_features == None` and `splitter != 'random'`. Since this is the case by default, I'm confused as to where the randomness is coming from since the default CART should be deterministic. I think an issue should be opened on the `sklearn` github pages to resolve this. – Matt Hancock Feb 23 '17 at 11:58
  • 2
    I opened the issue here: https://github.com/scikit-learn/scikit-learn/issues/8443 – Matt Hancock Feb 24 '17 at 11:30