When changing the order of the columns of the input for the sklearn DecisionTreeClassifier
the accuracy appears to change. This shouldn't be the case. What am I doing wrong?
from sklearn.datasets import load_iris
import numpy as np
iris = load_iris()
X = iris['data']
y = iris['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.90, random_state=0)
clf = DecisionTreeClassifier(random_state=0)
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))
clf = DecisionTreeClassifier(random_state=0)
clf.fit(np.hstack((X_train[:,1:], X_train[:,:1])), y_train)
print(clf.score(X_test, y_test))
clf = DecisionTreeClassifier(random_state=0)
clf.fit(np.hstack((X_train[:,2:], X_train[:,:2])), y_train)
print(clf.score(X_test, y_test))
clf = DecisionTreeClassifier(random_state=0)
clf.fit(np.hstack((X_train[:,3:], X_train[:,:3])), y_train)
print(clf.score(X_test, y_test))
Running this code results in the following output
0.9407407407407408
0.22962962962962963
0.34074074074074073
0.3333333333333333
This has been asked 3 years ago but the questioned got down voted because no code was provided. Does feature order impact Decision tree algorithm in sklearn?
Edit
In the above code I forgot to apply the column reordering to the test data.
I have found the different results to persist when applying the reordering to the whole dataset as well.
First I import the data and turn it into a pandas dataframe.
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
import numpy as np
iris = load_iris()
y = iris['target']
iris_features = iris['feature_names']
iris = pd.DataFrame(iris['data'], columns=iris['feature_names'])
I then select all of the data via the original ordered feature names. I train and evaluate the model.
X = iris[iris_features].values
print(X.shape[1], iris_features)
# 4 ['petal length (cm)', 'petal width (cm)', 'sepal length (cm)', 'sepal width (cm)']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.95, random_state=0)
clf = DecisionTreeClassifier(random_state=0)
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
print(np.mean(y_test == pred))
# 0.7062937062937062
Why do I still get different results? I then select a different order of the same columns to train and evaluate the model.
X = iris[iris_features[2:]+iris_features[:2]].values
print(X.shape[1], iris_features[2:]+iris_features[:2])
# 4 ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.95, random_state=0)
clf = DecisionTreeClassifier(random_state=0)
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
print(np.mean(y_test == pred))
# 0.8881118881118881