1

This is my code

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

dataset = load_iris()
X_train,X_test,y_train,y_test = train_test_split(dataset.data,dataset.target,test_size=0.3)


reg = DecisionTreeClassifier(max_depth=1)
reg.fit(X_train,y_train)
print(reg.predict(X_test))

enter image description here

I have added the image of tree for trained set, here you can see on false case the dataset has values of [0,39,38] which points to the output of 0,1,2 respectively. So from false dataset 1 has highest possibility to become an output. Decision tree should classify either 0 or 1 as per the tree but I can see 2 also in the prediction. So, How sklearn choose the class on false set under what condition to predict the output.

Venkatachalam
  • 16,288
  • 9
  • 49
  • 77
user9580899
  • 157
  • 3
  • 10
  • The algorithm grouped classes 1 and 2 into one leaf because you chose max_depth=1. – LazyCoder Jul 27 '19 at 12:47
  • Suppose I am giving petal length of 3, it should go on false side .now how algm predict either 1 or 2 that's my doubt – user9580899 Jul 27 '19 at 12:48
  • 1
    It is going on false side, isn't it? After that, due to gini index, it's almost a 50-50 chance from there. – LazyCoder Jul 27 '19 at 13:00
  • @LazyCoder My doubt is under this scenario how sklearn predict 1 or 2 as there is 50% chance of both.but sklearn returns scalar output – user9580899 Jul 27 '19 at 13:17
  • 1
    Doesn't that equivalent to asking "what will sklearn decision tree do for two classes with D=0"? I think it is, and in that case, the chances of each class will be according the occurrences, and the decision is taken by the majority (which correspond the chances). – mr_mo Jul 27 '19 at 13:20
  • @mr_mo on predictions it returns both 1 and 2, as per the image 1 has high chance – user9580899 Jul 27 '19 at 13:23
  • 1
    D=1: if petal_length < 2.45 --> predict 0 (all examples conditioned on petal_lenght<2.45 were labeled 0). Otherwise --> predict 1. This corresponds the "values" of each leaf which is the numbers of occurrences of each class on that exact branch of the tree. Take that, divide it by the sum of the values and you'll get the empirical probability of each class conditioned on the branch. – mr_mo Jul 27 '19 at 13:31
  • 1
    Please see https://en.wikipedia.org/wiki/C4.5_algorithm on the optimization of such trees (today more commonly used is the [XGBoost](https://xgboost.readthedocs.io/en/latest/) algorithm that implements gradient boosting for decision trees). This is by definition how the ID3 and C4.5 optimization works, also the Gini metric for mutual information between the branch and the labels (see https://en.wikipedia.org/wiki/Decision_tree_learning#Gini_impurity) – mr_mo Jul 27 '19 at 13:31
  • 2
    If you're asking why it also shows you 2 in the predictions, that could be due to randomization implemented in SKlearn feature and divider selection, see https://stackoverflow.com/questions/21391429/classification-tree-in-sklearn-giving-inconsistent-answers – mr_mo Jul 27 '19 at 13:36
  • @mr_mo thanks for sharing. it really helped me. – user9580899 Jul 29 '19 at 01:52
  • @user9580899 glad I could help – mr_mo Jul 29 '19 at 15:27

1 Answers1

1

I am sure, the difference would have been because of not setting the random_state.

There is two places for randomness here,

  • train test splitting
  • Building decision tree model

you might have predicted with a decision tree and then created a visualization using another decision tree.

Try the following code with different random_state values:

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.tree import plot_tree

dataset = load_iris()

X_train,X_test,y_train,y_test = train_test_split(dataset.data,
                                                 dataset.target,
                                                 test_size=0.3,
                                                 random_state=0)
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(max_depth=1, random_state=1)
clf.fit(X_train,y_train)
print(clf.predict(X_test))

plot_tree(clf)

enter image description here

Note: you need sklearn version 0.21.2 for plot_tree feature.

Venkatachalam
  • 16,288
  • 9
  • 49
  • 77