Let's say my training a model to detect churn, and the dataset has the following features (very simplified). I have makes and females, who have signed up online or by post.
ID source Gender Churn
1 Online M 1
2 Post M 1
3 Online M 1
4 Online F 0
5 Post F 0
And I apply pandas get_dummies
:
ID source_online source_post Gender_M Gender_F
1 1 0 1 0
2 0 1 1 0
3 1 0 1 0
4 1 0 0 1
5 0 1 0 1
Now let's say I use StandardScaler
and then fit
a model on this data and train it.
Some days later I get new data from the same database & schema and I have to predict churn. Exact same variables, except this time it has only males, who have only signed up online.
ID source Gender
1 Online M
2 Online M
3 Online M
I apply get_dummies
:
ID source_Online Gender_M
0 1 1 1
1 2 1 1
2 3 1 1
First of all, the StandardScaler
with the learned settings from the training set doesn't work on this unseen data, because it's missing some dummy variables
. And of course it does not work with the trained model, because of the same reason.
Is there any way around this?
I currently have hundreds of records with this problem because they are missing a single within a variable that was present in the training set. (In this simplified example we are missing females)