5

I have two sklearn estimators and want to compare them:

import numpy as np
from sklearn.tree import DecisionTreeClassifier

X, y = np.random.random((100,2)), np.random.choice(2,100)    
dt1 = DecisionTreeClassifier()
dt1.fit(X, y)
dt2 = DecisionTreeClassifier()
dt3 = sklearn.base.copy.deepcopy(dt1)

How can I compare classifiers so that dt1 != dt2, dt1 == dt3?

Dayvid Oliveira
  • 1,157
  • 2
  • 14
  • 34
  • 1
    Well you first have to define what you mean by two classifiers being equal. – cel Jul 16 '16 at 15:59
  • Same type of classifiers, same parameters, fitted with same data, same outputs ... basically, entirely equal, except for being different objects. – Dayvid Oliveira Jul 16 '16 at 21:42
  • 1
    This is not a common problem - there's no pre-implemented equality that can do this for you. Classifiers also don't store the training data, so finding out if they were fitted with the same data is probably not something you can do from the classifier alone. You may want to explain a little bit why you are needing this - maybe there's another way to solve the problem you are having. – cel Jul 17 '16 at 05:03

1 Answers1

4

You will want to compare the params assigned to the classifier instance and the .tree_.value of the trained classifiers:

# the trees have the same params
def compare_trees(tree1, tree2):
    if hash(tree1.__dict__.values())==hash(tree2.__dict__.values()):
        # the trees have both been trained
        if tree1.tree_ != None and tree2.tree_ != None: 
            try: # the tree values are matching arrays
                return (tree1.tree_.value==tree2.tree_.value).all()
            except: # they do not match
                return False
        elif tree1.tree_ != None or tree2.tree_ != None: 
            # XOR of the trees is not trained
            return False
        else: # Neither has been trained
            return True
    else: # the params are different
        return False


dt1 = DecisionTreeClassifier()
X, y = np.random.random((100,2)), np.random.choice(2,100)
dt1.fit(X, y)

dt2 = DecisionTreeClassifier() # untrained

dt3 = sklearn.base.copy.deepcopy(dt1) # copy of 1st

dt4 = DecisionTreeClassifier() # trained on different data
X_, y_ = np.random.random((100,2)), np.random.choice(2,100)   
dt4.fit(X_, y_)

print(compare_trees(dt1, dt1)) # True
print(compare_trees(dt1, dt2)) # False
print(compare_trees(dt1, dt3)) # True
print(compare_trees(dt1, dt4)) # False
zemekeneng
  • 1,660
  • 2
  • 15
  • 26