1

I have a training set that I'm using to build some machine learning models and I need to set up some code to predict on a test set (that I don't have access to).

For instance, if I have a DataFrame, train:

    car
0   Audi
1   BMW
2   Mazda

I can use pd.get_dummies to get:

   car_Audi car_BMW car_Mazda
0      1       0       0
1      0       1       0
2      0       0       1

Call this resulting DataFrame, train_encoded

Now, suppose my test DataFrame looks like:

    car
0   Mercedes

I can use:

pd.get_dummies(test).reindex(columns=train_encoded.columns)

to get:

   car_Audi car_BMW car_Mazda
0      0       0       0

How can I treat NaNs the same as an unseen value for my car column? That is, if I encounter NaN in my car column in in test, I want to get back:

   car_Audi car_BMW car_Mazda
0      0       0       0

Thanks!

anon_swe
  • 8,791
  • 24
  • 85
  • 145
  • `df.car=df.car.fillna('NAN'); pd.get_dummies(test).reindex(columns=train_encoded.columns)` – BENY Apr 29 '18 at 15:57
  • @Wen Won't I have an extra column in test after getting dummies if test has a NaN in `car` but train doesn't? – anon_swe Apr 29 '18 at 21:25

1 Answers1

1

If you generated a string filler, that does not appear in df.car, then, slightly modifying Wen's suggestion in the comment (for the case that 'NAN' is a string in df.car), you can use

df.car.fillna(filler, inplace=True) 
pd.get_dummies(test).reindex(columns=train_encoded.columns)

One way to define filler, if you have access to all of df.car in advance, is via

filler = '_' + ''.join(df.car.unique())

because it is at least longer by 1 than the longest string in it. Another way is by using a random string

filler = ''.join(random.choice(string.ascii_lowercase) for _ in range(10))

The probability you have such an item is less than len(df) / 26 ** 10.

Ami Tavory
  • 74,578
  • 11
  • 141
  • 185
  • If I have no NaNs in my `car` column for my training set but do have NaNs in my `car` column for test set, I'll have an extra column in my one-hot encoded test, right? – anon_swe Apr 29 '18 at 21:23