0

I have been playing around with sklearn a bit and following some simple examples online using the iris data.

I've now begun to play with some other datas. I'm not sure if this behaviour is correct and I'm misunderstanding but everytime I call fit(x,y) I get completely different tree data. So when I then run predictions I get varying differences (of around 10%), ie 60%, then 70%, then 65% etc...

I ran the code below twice to output 2 trees so I could read them in Word. I tried searching values from one doc in the other and a lot of them I couldn't find. I kind of assumed fit(x, y) would always return the same tree - if this is the case then I assume my train data of floats is punking me.

clf_dt = tree.DecisionTreeClassifier()
clf_dt.fit(x_train, y_train)
with open("output2.dot", "w") as output_file:
    tree.export_graphviz(clf_dt, out_file=output_file)
Tchotchke
  • 3,061
  • 3
  • 22
  • 37
user2616166
  • 139
  • 1
  • 10
  • In the future, I'd provide data so that you have a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example), as it'll be easier for people to help. Also, tagging your questions with `python` will get more eyes on it. – Tchotchke Jul 20 '16 at 11:35

1 Answers1

1

There is a random component to the algorithm, which you can read about in the user guide. The relevant part:

The problem of learning an optimal decision tree is known to be NP-complete under several aspects of optimality and even for simple concepts. Consequently, practical decision-tree learning algorithms are based on heuristic algorithms such as the greedy algorithm where locally optimal decisions are made at each node. Such algorithms cannot guarantee to return the globally optimal decision tree. This can be mitigated by training multiple trees in an ensemble learner, where the features and samples are randomly sampled with replacement.

If you want to achieve the same results each time, set the random_state parameter to an integer (by default it's None) and you should get the same result each time.

Tchotchke
  • 3,061
  • 3
  • 22
  • 37