0

i am practicing to use sklearn for decision tree, i am using the play tennis data set DataSet

play_ is the target column.

as per my pen and paper calculation of entropy and Information Gain, the root node should be outlook_ column because it has the highest entropy.

But somehow, my current decision tree has humidity as the root node, and look likes this: Decision Tree Current Scenario

my current code in python:

from sklearn.cross_validation import train_test_split 
from sklearn.tree import DecisionTreeClassifier 
from sklearn.metrics import accuracy_score 
from sklearn import tree 
from sklearn.preprocessing import LabelEncoder

import pandas as pd 
import numpy as np 

df = pd.read_csv('playTennis.csv') 

lb = LabelEncoder() 
df['outlook_'] = lb.fit_transform(df['outlook']) 
df['temp_'] = lb.fit_transform(df['temp'] ) 
df['humidity_'] = lb.fit_transform(df['humidity'] ) 
df['windy_'] = lb.fit_transform(df['windy'] )   
df['play_'] = lb.fit_transform(df['play'] ) 
X = df.iloc[:,5:9] 
Y = df.iloc[:,9]

X_train, X_test , y_train,y_test = train_test_split(X, Y, test_size = 0.3, random_state = 100) 

clf_entropy = DecisionTreeClassifier(criterion='entropy')
clf_entropy.fit(X_train.astype(int),y_train.astype(int)) 
y_pred_en = clf_entropy.predict(X_test)

print("Accuracy is :{0}".format(accuracy_score(y_test.astype(int),y_pred_en) * 100))
  • I recommend changing the title to something more expressive and more specific to your problem to attract answers. Also consider making the dataset "copypastable", e.g. see [here](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – Marcus V. Jan 14 '18 at 20:21
  • Also, please post the full code as to how you displayed the tree. I'm not able to duplicate the tree given above. In my case, humidity is still at top but entropy is different. Showing how you calculated the entropy by hand will also be preferable. – Vivek Kumar Jan 15 '18 at 05:44

1 Answers1

0

My guess would be that test and train split happen in a way that the split by humidity ends up having better information gain than outlook. Have you done your pen & paper calculations based on the trainings set or based on the whole data set?

Anne
  • 583
  • 5
  • 15