In sklearn, how can one-hot encoding help when building decision tree with categorical features?

Question

According to here, sklearn can not handle categorical variables. And use one-hot encoding to deal with these features are suggested. However, I do not understand how can one-hot encoding help? For example, country=USA or China or England is transformed into country=USA is true or false, the new feature 'country==USA' is still categorical after all (can only take 0 or 1). That does not change anything. Sklearn still treats 0 or 1 as numerical values.

For a real example here, I transformed data:

human,warm-blooded,hair,yes,no,no,yes,no,mammal
python,cold-blooded,scales,no,no,no,no,yes,reptile
salmon,cold-blooded,scales,no,yes,no,no,no,fish
whale,warm-blooded,hair,yes,yes,no,no,no,mammal
frog,cold-blooded,none,no,semi,no,yes,yes,amphibian
komodo dragon,cold-blooded,scales,no,no,no,yes,no,reptile
bat,warm-blooded,hair,yes,no,yes,yes,yes,mammal
pigeon,warm-blooded,feathers,no,no,yes,yes,no,bird
cat,warm-blooded,fur,yes,no,no,yes,no,mammal
leopard shark,cold-blooded,scales,yes,yes,no,no,no,fish
turtle,cold-blooded,scales,no,semi,no,yes,no,reptile
penguin,warm-blooded,feathers,no,semi,no,yes,no,bird
porcupine,warm-blooded,quills,yes,no,no,yes,yes,mammal
eel,cold-blooded,scales,no,yes,no,no,no,fish
salamander,cold-blooded,none,no,semi,no,yes,yes,amphibian
gila monster,cold-blooded,scales,no,no,no,yes,yes,

into

[[1. 0. 1. 0. 0. 0. 1. 0. 1. 0. 1. 1. 0. 0. 0. 1. 0. 0. 0.]
 [1. 0. 1. 0. 0. 1. 0. 1. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 1.]
 [1. 0. 0. 0. 1. 1. 0. 1. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 1.]
 [1. 0. 0. 0. 1. 0. 1. 0. 1. 1. 0. 1. 0. 0. 0. 1. 0. 0. 0.]
 [1. 0. 0. 1. 0. 1. 0. 1. 0. 0. 1. 0. 1. 0. 0. 0. 1. 0. 0.]
 [1. 0. 1. 0. 0. 1. 0. 1. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 1.]
 [0. 1. 1. 0. 0. 0. 1. 0. 1. 0. 1. 0. 1. 0. 0. 1. 0. 0. 0.]
 [0. 1. 1. 0. 0. 0. 1. 1. 0. 0. 1. 1. 0. 1. 0. 0. 0. 0. 0.]
 [1. 0. 1. 0. 0. 0. 1. 0. 1. 0. 1. 1. 0. 0. 1. 0. 0. 0. 0.]
 [1. 0. 0. 0. 1. 1. 0. 0. 1. 1. 0. 1. 0. 0. 0. 0. 0. 0. 1.]
 [1. 0. 0. 1. 0. 1. 0. 1. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 1.]
 [1. 0. 0. 1. 0. 0. 1. 1. 0. 0. 1. 1. 0. 1. 0. 0. 0. 0. 0.]
 [1. 0. 1. 0. 0. 0. 1. 0. 1. 0. 1. 0. 1. 0. 0. 0. 0. 1. 0.]
 [1. 0. 0. 0. 1. 1. 0. 1. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 1.]
 [1. 0. 0. 1. 0. 1. 0. 1. 0. 0. 1. 0. 1. 0. 0. 0. 1. 0. 0.]
 [1. 0. 1. 0. 0. 1. 0. 1. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 1.]]

and build a decision tree like

decision tree(click to open) The split point is still ridiculous(like Give Birth=no<=0.5). So I do not think one-hot encoding can help deal with categorical data at all.

What they probably mean is that sklearn cannot handle features with multiple categories. sklearn converts everything to floats, so 1s and 0s are fine, but something like a 3 label class (data ∈ {0,1,2}) will have an implicit order. I think you're fine here. — ChootsMagoots, Mar 16 '18 at 18:28

score 0 · Accepted Answer · answered Mar 17 '18 at 02:17

First of all, we must also keep in mind that SKlearn can only build binary tree. For example, there is a color feature which takes 0,1,2,3,4,5 for different colors. We split the color feature using color<=2.5, then 0,1,2 is the left leaf, 3,4,5 is the right leaf, which is not what we desired as there is no order in features. If there is indeed an order, I think we can do without one-hot encoding.

The one-hot encoding can act as if Sklearn can handle categorical data. For example, if there is a real feature "country==USA" in the data which takes 0 or 1 and it is used as the splitting feature, then the leaves are country==USA is 0 and country==USA is 1. Although Sklearn still uses numerical splitting points such as 0.5 (the splitting point must between 0 and 1 or it won't be a good split), if country==USA<=0.5 goes to the left leaf, else goes to the right leaf, 'country==USA is 0' will go to the left leaf and 'country==USA is 1' will go to the right one, which is the same effect of splitting the "country==USA" feature according to 0 and 1.

In sklearn, how can one-hot encoding help when building decision tree with categorical features?

1 Answers1