Categorical variables

Question

I want to recode categorical variables before using a ML-model from sklearn. I will use the variables for modeling a decision tree, not for visualizations.

I have read the sklearn docs on converting categorical variables before modeling. I could either * use pandas get_dummies function (although that would make the syntax diffucult for the ordered columns, since the argument is a bit clumsy?) * or I could use sklearn's built in functions LabelEncoder and OneHotEncoder.

Why would I use sklearn when this can be done in pandas in a single line?

pd.get_dummies(data=df, columns=['col1', 'col2'], drop_first=True)

Here is how I would do it in sklearn.

# step i) label encoder, to go from strings to intergers 
le = preprocessing.LabelEncoder()
df[colcat] = le.fit_transform(df[colcat])

# step ii) one hot encoder, to go from integers to dummies 
enc = preprocessing.OneHotEncoder()
df[colcat] = enc.fit_transform()

# automate step i and ii 
def labelencoder_and_onehotencoder(x):
    le = preprocessing.LabelEncoder()
    x = le.fit_transform(x)
    enc = preprocessing.OneHotEncoder()
    x = enc.fit_transform(x)
    return x
cols_categorical = ['col1', 'col2'] 
df[cols_categorical] = df[cols_categorical].apply(labelencoder_and_onehotencoder)

Please look at discussions [here](https://stackoverflow.com/q/48090658/3374996), [here](https://stackoverflow.com/q/48201501/3374996), [here](https://stackoverflow.com/q/48320396/3374996) and [here](https://stackoverflow.com/q/48074462/3374996) — Vivek Kumar, Jan 23 '18 at 14:07
You don't need to use both `LabelEncoder` and `OneHotEncoder`. Just use `OneHotEncoder`. The motivation for using `OneHotEncoder` over `get_dummies` is in building a [sklearn pipeline](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html), where you might apply some preprocessing to your data before applying one-hot-encoding, such as data imputation. — Scratch'N'Purr, Jan 23 '18 at 14:07

score -1 · Answer 1 · answered Jan 23 '18 at 14:10

What get_dummies does is that it ends up creating converting categorical variables into binary variables.

LabelEncoder on the other hand typically is used to enumerate categorical variables so that they can be used directly when modeling as sklearn libraries don't usually allow for handling categorical variables when modeling as in this example

>>> le = preprocessing.LabelEncoder()
>>> le.fit(["paris", "paris", "tokyo", "amsterdam"])
LabelEncoder()
>>> list(le.classes_)
['amsterdam', 'paris', 'tokyo']
>>> le.transform(["tokyo", "tokyo", "paris"]) 
array([2, 2, 1]...)
>>> list(le.inverse_transform([2, 2, 1]))
['tokyo', 'tokyo', 'paris']

In the example you provided, you are using encoding the same way as get_dummies in which case the only difference between encoding and getting dummies is that encoding usually allows for a place to store the conversion.

In my experience, there might be times when your modeling and test dataset don't have the same set of categorical values, in which case using encoding will result in you encoding the variable incorrectly

Categorical variables

1 Answers1