0

Let's say my training a model to detect churn, and the dataset has the following features (very simplified). I have makes and females, who have signed up online or by post.

ID  source  Gender  Churn
1   Online  M       1
2   Post    M       1
3   Online  M       1
4   Online  F       0
5   Post    F       0

And I apply pandas get_dummies:

ID  source_online   source_post Gender_M    Gender_F
1       1               0          1          0
2       0               1          1          0
3       1               0          1          0
4       1               0          0          1
5       0               1          0          1

Now let's say I use StandardScaler and then fit a model on this data and train it.

Some days later I get new data from the same database & schema and I have to predict churn. Exact same variables, except this time it has only males, who have only signed up online.

ID  source  Gender
1   Online  M
2   Online  M
3   Online  M

I apply get_dummies:

   ID  source_Online  Gender_M
0   1              1         1
1   2              1         1
2   3              1         1

First of all, the StandardScaler with the learned settings from the training set doesn't work on this unseen data, because it's missing some dummy variables. And of course it does not work with the trained model, because of the same reason.

Is there any way around this?

I currently have hundreds of records with this problem because they are missing a single within a variable that was present in the training set. (In this simplified example we are missing females)

SCool
  • 3,104
  • 4
  • 21
  • 49
  • This looks like a duplicate of this question: https://stackoverflow.com/questions/41335718/keep-same-dummy-variable-in-training-and-testing-data – praneeth Nov 20 '19 at 17:56
  • Take a look at the answer I provided in this similar question :https://stackoverflow.com/questions/58799643/sklearn-logistic-regression-valueerror-x-has-42-features-per-sample-expecting/58799980#58799980 – Chris Nov 21 '19 at 04:30
  • Does this answer your question? [One hot encoding train with values not present on test](https://stackoverflow.com/questions/57946006/one-hot-encoding-train-with-values-not-present-on-test) – MaximeKan Nov 22 '19 at 01:10
  • @MaximeKan `OneHotEncoder` requires that I convert all my categorical to numbers first, such as Gender: M/F to Gender 1/0 etc etc. This isn't very convenient, and then I lose the column names after using `onehotencoder`, so there's another extra step involved to get the column names back. So I stayed with pandas `get_dummies` and used the suggestions in @Praneeth's link. – SCool Nov 22 '19 at 10:38

0 Answers0