-1

According to here, sklearn can not handle categorical variables. And use one-hot encoding to deal with these features are suggested. However, I do not understand how can one-hot encoding help? For example, country=USA or China or England is transformed into country=USA is true or false, the new feature 'country==USA' is still categorical after all (can only take 0 or 1). That does not change anything. Sklearn still treats 0 or 1 as numerical values.

For a real example here, I transformed data:

human,warm-blooded,hair,yes,no,no,yes,no,mammal
python,cold-blooded,scales,no,no,no,no,yes,reptile
salmon,cold-blooded,scales,no,yes,no,no,no,fish
whale,warm-blooded,hair,yes,yes,no,no,no,mammal
frog,cold-blooded,none,no,semi,no,yes,yes,amphibian
komodo dragon,cold-blooded,scales,no,no,no,yes,no,reptile
bat,warm-blooded,hair,yes,no,yes,yes,yes,mammal
pigeon,warm-blooded,feathers,no,no,yes,yes,no,bird
cat,warm-blooded,fur,yes,no,no,yes,no,mammal
leopard shark,cold-blooded,scales,yes,yes,no,no,no,fish
turtle,cold-blooded,scales,no,semi,no,yes,no,reptile
penguin,warm-blooded,feathers,no,semi,no,yes,no,bird
porcupine,warm-blooded,quills,yes,no,no,yes,yes,mammal
eel,cold-blooded,scales,no,yes,no,no,no,fish
salamander,cold-blooded,none,no,semi,no,yes,yes,amphibian
gila monster,cold-blooded,scales,no,no,no,yes,yes,

into

[[1. 0. 1. 0. 0. 0. 1. 0. 1. 0. 1. 1. 0. 0. 0. 1. 0. 0. 0.]
 [1. 0. 1. 0. 0. 1. 0. 1. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 1.]
 [1. 0. 0. 0. 1. 1. 0. 1. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 1.]
 [1. 0. 0. 0. 1. 0. 1. 0. 1. 1. 0. 1. 0. 0. 0. 1. 0. 0. 0.]
 [1. 0. 0. 1. 0. 1. 0. 1. 0. 0. 1. 0. 1. 0. 0. 0. 1. 0. 0.]
 [1. 0. 1. 0. 0. 1. 0. 1. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 1.]
 [0. 1. 1. 0. 0. 0. 1. 0. 1. 0. 1. 0. 1. 0. 0. 1. 0. 0. 0.]
 [0. 1. 1. 0. 0. 0. 1. 1. 0. 0. 1. 1. 0. 1. 0. 0. 0. 0. 0.]
 [1. 0. 1. 0. 0. 0. 1. 0. 1. 0. 1. 1. 0. 0. 1. 0. 0. 0. 0.]
 [1. 0. 0. 0. 1. 1. 0. 0. 1. 1. 0. 1. 0. 0. 0. 0. 0. 0. 1.]
 [1. 0. 0. 1. 0. 1. 0. 1. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 1.]
 [1. 0. 0. 1. 0. 0. 1. 1. 0. 0. 1. 1. 0. 1. 0. 0. 0. 0. 0.]
 [1. 0. 1. 0. 0. 0. 1. 0. 1. 0. 1. 0. 1. 0. 0. 0. 0. 1. 0.]
 [1. 0. 0. 0. 1. 1. 0. 1. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 1.]
 [1. 0. 0. 1. 0. 1. 0. 1. 0. 0. 1. 0. 1. 0. 0. 0. 1. 0. 0.]
 [1. 0. 1. 0. 0. 1. 0. 1. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 1.]]

and build a decision tree like

decision tree(click to open) The split point is still ridiculous(like Give Birth=no<=0.5). So I do not think one-hot encoding can help deal with categorical data at all.

Naomi
  • 127
  • 10
  • What they probably mean is that sklearn cannot handle features with multiple categories. sklearn converts everything to floats, so 1s and 0s are fine, but something like a 3 label class (data ∈ {0,1,2}) will have an implicit order. I think you're fine here. – ChootsMagoots Mar 16 '18 at 18:28

1 Answers1

0

First of all, we must also keep in mind that SKlearn can only build binary tree. For example, there is a color feature which takes 0,1,2,3,4,5 for different colors. We split the color feature using color<=2.5, then 0,1,2 is the left leaf, 3,4,5 is the right leaf, which is not what we desired as there is no order in features. If there is indeed an order, I think we can do without one-hot encoding.

The one-hot encoding can act as if Sklearn can handle categorical data. For example, if there is a real feature "country==USA" in the data which takes 0 or 1 and it is used as the splitting feature, then the leaves are country==USA is 0 and country==USA is 1. Although Sklearn still uses numerical splitting points such as 0.5 (the splitting point must between 0 and 1 or it won't be a good split), if country==USA<=0.5 goes to the left leaf, else goes to the right leaf, 'country==USA is 0' will go to the left leaf and 'country==USA is 1' will go to the right one, which is the same effect of splitting the "country==USA" feature according to 0 and 1.

Naomi
  • 127
  • 10