I want to recode categorical variables before using a ML-model from sklearn. I will use the variables for modeling a decision tree, not for visualizations.
I have read the sklearn docs on converting categorical variables before modeling. I could either
* use pandas get_dummies
function (although that would make the syntax diffucult for the ordered columns, since the argument is a bit clumsy?)
* or I could use sklearn's built in functions LabelEncoder
and OneHotEncoder
.
Why would I use sklearn when this can be done in pandas in a single line?
pd.get_dummies(data=df, columns=['col1', 'col2'], drop_first=True)
Here is how I would do it in sklearn.
# step i) label encoder, to go from strings to intergers
le = preprocessing.LabelEncoder()
df[colcat] = le.fit_transform(df[colcat])
# step ii) one hot encoder, to go from integers to dummies
enc = preprocessing.OneHotEncoder()
df[colcat] = enc.fit_transform()
# automate step i and ii
def labelencoder_and_onehotencoder(x):
le = preprocessing.LabelEncoder()
x = le.fit_transform(x)
enc = preprocessing.OneHotEncoder()
x = enc.fit_transform(x)
return x
cols_categorical = ['col1', 'col2']
df[cols_categorical] = df[cols_categorical].apply(labelencoder_and_onehotencoder)