0

I read some material:

However I can find if I change feature order(the set of feature name: [a,b,c] change into [b,a,c]) in the data. Does this actually affect decision tree result?

Rachel Jennifer
  • 493
  • 1
  • 4
  • 16

2 Answers2

0

Not really. Sklearn generally uses Cart trees where the best split is decided by picking the feature that minimizes a cost function. So the order of column doesn't really matter.

  • Thank you, I have a question. If the data is so large(Maybe [10^6, 10^5]). In the first time, I have to compute all the feature's best splitter. That maybe O(n^2). This is very slowly. Is this correct? – Rachel Jennifer May 12 '17 at 15:39
  • Yes, a single CART decision tree will need to scan trough all the data and all the features to find the best split. Usually you wouldn’t code a tree yourself, since there are many optimised libraries that do that job for you pretty quickly. – Giovanni Bruner May 13 '17 at 18:07
0

Reordering column names can change sklearn decision tree result. The issue is that for each split, max_features are considered and the feature with highest impact (e.g. largest gini reduction) is chosen.

However, if multiple features have the same impact -- this is not as uncommon as might seem especially with a high number of binary features -- one of them is chosen randomly.

In such a case, order of columns in your dataset may impact what feature is selected in the final decision tree. To make sure you prevent this, it is necessary to both choose random seed and to sort your dataframe columns before fitting your tree.

More explanations can be found here (engelen's answer) and in this Github discussion.

Dudelstein
  • 383
  • 3
  • 16