2

I am wondering what is the difference between pandas' get_dummies() encoding of categorical features as compared to the sklearn's OneHotEncoder().

I've seen answers that mention that get_dummies() cannot produce encoding for categories not seen in the training dataset (answers here). However, this is a result of having performed the get_dummies() separately on the testing and training datasets (which can give inconsistent shapes). On the other hand, if we applied the get_dummies() on the original dataset, before splitting it, I think the two methods should give identical results. Am I wrong? Would that cause problems?

My code is currently working like the one below:

def one_hot_encode(ds,feature):
    #get DF of dummy variables
    dummies = pd.get_dummies(ds[feature])
    #One dummy variable to drop (Dummy Trap)
    dummyDrop = dummies.columns[0]
    #Create a DF from the original and the dummies' DF
    #Drop the original categorical variable and the one dummy
    final =  pd.concat([ds,dummies], axis='columns').drop([feature,dummyDrop], axis='columns')
    return final

#Get data DF
dataset = pd.read_csv("census_income_dataset.csv")
columns = dataset.columns

#Perform one-hot-encoding on the DF (See function above) on categorical features
features = ["workclass","marital_status","occupation","relationship","race","sex","native_country"]
for f in features:
    dataset = one_hot_encode(dataset,f)
#Re-order to get ouput feature in last column
dataset = dataset[[c for c in dataset.columns if c!="income_level"]+["income_level"]]
dataset.head()
Chris X
  • 21
  • 1

1 Answers1

1

If you apply get_dummies() and OneHotEncoder() in the general dataset, you should obtain the same result.

If you apply get_dummies() in the general dataset, and OneHotEncoder() in the train dataset, you will probably obtain a few (very small) differences if in the test data you have a "new" category. If not, they should have the same result.

The main difference between get_dummies() and OneHotEncoder() is how they work when you are using this model in real life (or in production) and your receive a "new" class of a categorical column that you haven't faced before

Example: Imagine your category "sex" can be only: male or female, and you sold your model to a company. What will happen if now, the category "sex" receives the value: "NA" (not applicable)? (Also, you can image that "NA" is an option, but it only appear 0.001%, and casually, you don't have any of this value in your dataset)

Using get_dummies(), you will have a problem, since your model is trained for only 2 different categories of sex, and now, you have a different and new category that the model can't hand with it.

Using OneHotEncoder(), will allow you to "ignore" this new category that your model can't face, allowing you to keep the same shape between the model input, and your new sample input.

That's why people uses OneHotEncoder() in train set and not in the general dataset, they are "simulating" this type of success (having "new" class you haven't faced before in a categorical column)

Alex Serra Marrugat
  • 1,849
  • 1
  • 4
  • 14
  • It is pretty easy to maintain only features in your train set using `get_dummies` with a simple join to a dataframe, or using `not in` if working with `numpy`, etc... – artemis Jan 07 '21 at 13:10