1

When changing the order of the columns of the input for the sklearn DecisionTreeClassifier the accuracy appears to change. This shouldn't be the case. What am I doing wrong?

from sklearn.datasets import load_iris
import numpy as np

iris = load_iris()

X = iris['data']
y = iris['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.90, random_state=0)


clf = DecisionTreeClassifier(random_state=0)
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))

clf = DecisionTreeClassifier(random_state=0)
clf.fit(np.hstack((X_train[:,1:], X_train[:,:1])), y_train)
print(clf.score(X_test, y_test))

clf = DecisionTreeClassifier(random_state=0)
clf.fit(np.hstack((X_train[:,2:], X_train[:,:2])), y_train)
print(clf.score(X_test, y_test))

clf = DecisionTreeClassifier(random_state=0)
clf.fit(np.hstack((X_train[:,3:], X_train[:,:3])), y_train)
print(clf.score(X_test, y_test))

Running this code results in the following output

0.9407407407407408
0.22962962962962963
0.34074074074074073
0.3333333333333333

This has been asked 3 years ago but the questioned got down voted because no code was provided. Does feature order impact Decision tree algorithm in sklearn?


Edit

In the above code I forgot to apply the column reordering to the test data.

I have found the different results to persist when applying the reordering to the whole dataset as well.

First I import the data and turn it into a pandas dataframe.

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
import numpy as np

iris = load_iris()
y = iris['target']
iris_features = iris['feature_names']
iris = pd.DataFrame(iris['data'], columns=iris['feature_names'])

I then select all of the data via the original ordered feature names. I train and evaluate the model.

X = iris[iris_features].values
print(X.shape[1], iris_features)
# 4 ['petal length (cm)', 'petal width (cm)', 'sepal length (cm)', 'sepal width (cm)']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.95, random_state=0)

clf = DecisionTreeClassifier(random_state=0)
clf.fit(X_train, y_train)
pred = clf.predict(X_test)

print(np.mean(y_test == pred))
# 0.7062937062937062

Why do I still get different results? I then select a different order of the same columns to train and evaluate the model.

X = iris[iris_features[2:]+iris_features[:2]].values
print(X.shape[1], iris_features[2:]+iris_features[:2])
# 4 ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.95, random_state=0)

clf = DecisionTreeClassifier(random_state=0)
clf.fit(X_train, y_train)
pred = clf.predict(X_test)

print(np.mean(y_test == pred))
# 0.8881118881118881

Marco Wedemeyer
  • 366
  • 2
  • 11

1 Answers1

3

You had missed to apply the column ordering the in the test data (X_test). When you do the same on the test data, you will get the same score.

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
import numpy as np

iris = load_iris()

X = iris['data']
y = iris['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.90, random_state=0)


def shuffle_data(data, n):
    return np.hstack((data[:,n:], data[:,:n]))

clf = DecisionTreeClassifier(random_state=0)
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))
# 0.9407407407407408

clf = DecisionTreeClassifier(random_state=0)
clf.fit(shuffle_data(X_train,1), y_train)
print(clf.score(shuffle_data(X_test,1), y_test))
# 0.9407407407407408

clf = DecisionTreeClassifier(random_state=0)
clf.fit(shuffle_data(X_train,2), y_train)
print(clf.score(shuffle_data(X_test,2), y_test))
# 0.9407407407407408

clf = DecisionTreeClassifier(random_state=0)
clf.fit(shuffle_data(X_train,3), y_train)
print(clf.score(shuffle_data(X_test,3), y_test))
# 0.9407407407407408

Update:

In your second example, you set the test_size equal to 0.95, which has left you with only 7 data point and their classes are array([0, 0, 0, 2, 1, 2, 0]).

If you measure the training score of decision tree in both cases, it is 1.0. This tells us that the model has found the optimal seperation in both the scenarios.

Simple answer is yes, the results would vary when the column order changes, when different combination of rules (different splitting condition) could lead to perfect separation of datapoints (100% accuracy).

Using plot_tree we can visualise the tree. Here we need to understand the implementation of DecisionTree. This answer quotes the important point from the documentation:

The problem of learning an optimal decision tree is known to be NP-complete under several aspects of optimality and even for simple concepts. Consequently, practical decision-tree learning algorithms are based on heuristic algorithms such as the greedy algorithm where locally optimal decisions are made at each node. Such algorithms cannot guarantee to return the globally optimal decision tree. This can be mitigated by training multiple trees in an ensemble learner, where the features and samples are randomly sampled with replacement.

The point that we need to concentrate here is that practical decision-tree learning algorithms are based on heuristic algorithms such as the greedy algorithm where locally optimal decisions are made at each node when take a greedy algorithm, the change of column order can impact its results.

At the same, when there are more datapoints in your dataset (when is not in your example), it is highly unlikely to get different results when you change order of the columns.

Even in this example, when set test_size=0.90, we can get the same score as 0.9407407407407408.

Venkatachalam
  • 16,288
  • 9
  • 49
  • 77