7
  • I have trained a ML model, and stored it into a Pickle file.
  • In my new script, I am reading new 'real world data', on which I want to do a prediction.

However, I am struggling. I have a column (containing string values), like:

Sex       
Male       
Female
# This is just as example, in real it is having much more unique values

Now comes the issue. I received a new (unique) value, and now I cannot make predictions anymore (e.g. 'Neutral' was added).

Since I am transforming the 'Sex' column into Dummies, I do have the issue that my model is not accepting the input anymore,

Number of features of the model must match the input. Model n_features is 2 and input n_features is 3

Therefore my question: is there a way how I can make my model robust, and just ignore this class? But do a prediction, without the specific info?

What I have tried:

df = pd.read_csv('dataset_that_i_want_to_predict.csv')
model = pickle.load(open("model_trained.sav", 'rb'))

# I have an 'example_df' containing just 1 row of training data (this is exactly what the model needs)
example_df = pd.read_csv('reading_one_row_of_trainings_data.csv')

# Checking for missing columns, and adding that to the new dataset 
missing_cols = set(example_df.columns) - set(df.columns)
for column in missing_cols:
    df[column] = 0 #adding the missing columns, with 0 values (Which is ok. since everything is dummy)

# make sure that we have the same order 
df = df[example_df.columns] 

# The prediction will lead to an error!
results = model.predict(df)

# ValueError: Number of features of the model must match the input. Model n_features is X and n_features is Y

Note, I searched, but could not find any helpfull solution (not here, here or here

UPDATE

Also found this article. But same issue here.. we can make the test set with the same columns as training set... but what about new real world data (e.g. the new value 'Neutral')?

Venkatachalam
  • 16,288
  • 9
  • 49
  • 77
R overflow
  • 1,292
  • 2
  • 17
  • 37
  • If you filter out (remove) the entries with "Neutral", do the other entries generate predictions without error? – rickhg12hs Nov 19 '20 at 11:43
  • Hi Rick, yes. Since that column is transformed into a dummy column, we have a column called 'Sex_Male', Sex_Female'. It looks like the model is accepting a row, where both values are 0. – R overflow Nov 19 '20 at 12:09
  • One quick-fix for this (not really recommended tho) is to make another class in your training data as "other", and maybe generating some artificial data for other features, using your dataset. And, after you get anything other than "male" or "female" in the feature "Sex", you can preprocess it as "other", and feed to the model. Yet, this is not a good approach as it wouldn't capture wide enough the intended thing, and probably might impact model performance in a bad way. The easier and more reliable approach is to have those nominal features fixed, and do not accept "other", consider "gender". – null Nov 23 '20 at 08:31

1 Answers1

7

Yes, you can't include (update the model) a new category or feature into a dataset after the training part is done. OneHotEncoder might handle the problem of having new categories inside some feature in test data. It will take care of keep the columns consistent in your training and test data with respect to categorical variables.

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder
import numpy as np
import pandas as pd
from sklearn import set_config
set_config(print_changed_only=True)
df = pd.DataFrame({'feature_1': np.random.rand(20),
                   'feature_2': np.random.choice(['male', 'female'], (20,))})
target = pd.Series(np.random.choice(['yes', 'no'], (20,)))

model = Pipeline([('preprocess',
                   ColumnTransformer([('ohe',
                                       OneHotEncoder(handle_unknown='ignore'), [1])],
                                       remainder='passthrough')),
                  ('lr', LogisticRegression())])

model.fit(df, target)

# let us introduce new categories in feature_2 in test data
test_df = pd.DataFrame({'feature_1': np.random.rand(20),
                        'feature_2': np.random.choice(['male', 'female', 'neutral', 'unknown'], (20,))})
model.predict(test_df)
# array(['yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes',
#       'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes',
#       'yes', 'yes'], dtype=object)
Venkatachalam
  • 16,288
  • 9
  • 49
  • 77
  • 1
    Thanks @Venkatachalam, would it be possible to explain a bit more what the 'Pipeline' function is doing, especially in combination with 'preprocess', 'ohe' and the OneHotEncoder? Can I assume that it creates a pipeline, that automatically transform new data into dummies, and if a value is new, it will ignore it? So, my real data is containing numeric values and categories. Can I assume that this function will also replace the whole pd_getdummies() function (that i used for preprocessing the data?) Many thanks again! – R overflow Nov 23 '20 at 12:30
  • 2
    `pipeline` is just convenient object to place all the sequence of steps that we want to apply on our dataset before fitting the final model. Please read [here](https://scikit-learn.org/stable/modules/compose.html) and [here](https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html) for more explanation. Yes, you can assume that this function will replace the whole `pd.get_dummies()`. – Venkatachalam Nov 23 '20 at 12:45
  • Update: you need to give `remainder='passthrough'` for allowing the other columns to be added the output of the columnTransformer.\ – Venkatachalam Dec 03 '20 at 07:46