As already statet, normally you should do one hot encoding before splitting.
But there is another problem. One day you surely want to apply your trained ML model to data in the wild. I mean data, that you have not seen before and you need to do exactly the same transformation for the dummies, as when you trained the model.
Then you could have to deal with two cases.
- is, that the new data contains categories that you did not have in your training data and
- is the other way round, that a category doesn't appear anymore in your dataset, but your model has been trained with it.
In case 1. you should just ignore the value, since your model most likely can't deal with it not beeing trained on it. In case 2. you should still generate these empty categories to have the same structure in the data you want to predict as in your training set. Note, that the pandas method wouldn't generate dummies for these categories and thus cannot guarante that you get the same structure from your prediction data as you had in your training data and therefore most likely your model will not be applicable to the data.
You can address this by using the sklearn equivalent to get_dummies (with just a little more work), which looks like this:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
# create some example data
df= pd.DataFrame({'x': [1, 2, 3], 'y': [2, 4, 8]})
# create a one hot encoder to create the dummies and fit it to the data
ohe= OneHotEncoder(handle_unknown='ignore', sparse=False)
ohe.fit(df[['x']])
# now let's simulate the two situations A and B
df.loc[1, 'x']= 1
df= df.append(dict(x=5, y=5), ignore_index=True)
# the actual feature generation is done in a separate step
tr=ohe.transform(df[['x']])
# if you need the columns in your existing data frame, you can glue them together
df2=pd.DataFrame(tr, columns=['oh1', 'oh2', 'oh3'], index=df.index)
result= pd.concat([df, df2], axis='columns')
With sklearn OneHotEncoder
you can separate the identification of the categories from the actual one-hot-encoding (the creation of the dummies). And you could also save the fitted one hot encoder, to be able to apply it later during the application of your model. Note the handle_unknown option, which tells the one hot encoder, that in case it will encouter something unknown later, it should just ignore it, instead of raising an error.