I read some material:
However I can find if I change feature order(the set of feature name: [a,b,c] change into [b,a,c]) in the data. Does this actually affect decision tree result?
I read some material:
However I can find if I change feature order(the set of feature name: [a,b,c] change into [b,a,c]) in the data. Does this actually affect decision tree result?
Not really. Sklearn generally uses Cart trees where the best split is decided by picking the feature that minimizes a cost function. So the order of column doesn't really matter.
Reordering column names can change sklearn
decision tree result. The issue is that for each split, max_features
are considered and the feature with highest impact (e.g. largest gini reduction) is chosen.
However, if multiple features have the same impact -- this is not as uncommon as might seem especially with a high number of binary features -- one of them is chosen randomly.
In such a case, order of columns in your dataset may impact what feature is selected in the final decision tree. To make sure you prevent this, it is necessary to both choose random seed
and to sort your dataframe columns before fitting your tree.
More explanations can be found here (engelen's answer) and in this Github discussion.