2

I have completed training the scikit-learn model and saved it as a pickle file. Now I want to load the model and run the prediction but I don't know how to preprocess the input data.

dataset = {
    'airline': ['SpiceJet', 'Indigo', 'Air_India']
}
df = pd.DataFrame.from_dict(dataset)

The airline column has 3 airlines which will be used to create dummy columns with this code:

def preprocessing(df):
    dummies = pd.get_dummies(df["airline"], drop_first=True)
    return dummies

The dataset for training will have the schema like this:

| airline_SpiceJet | airline_Indigo | airline_Air_India |

My question is with the input below, how can I map the input to the corresponding column?

input = {
    'airline': ['SpiceJet']
}

The expected output for the dataset:

| airline_SpiceJet | airline_Indigo | airline_Air_India |
| ---------------- | -------------- | ----------------- |
|                1 |              0 |                 0 |
desertnaut
  • 57,590
  • 26
  • 140
  • 166
huy
  • 1,648
  • 3
  • 14
  • 40
  • Is the expected output for the dataset supposed to keep count of how often each airline is present in the input? So if the input had spicejet twice in there, the column value in the output would be 2 instead of 1? – Kim Tang Jul 26 '22 at 06:13
  • @KimTang No, it should be 1. Each element in the `airline` list is a row in the dataset. So if the input has SpiceJet twice, the dataset will have two rows. – huy Jul 26 '22 at 06:16
  • 2
    You should be using `OneHotEncoder` instead of `get_dummies()`. `OneHotEncoder` allows you to transform directly your input in the same way you did in the training phase. – Alex Serra Marrugat Jul 26 '22 at 07:05

1 Answers1

1

I think the problem with pandas get_dummies() method is that it defines the columns for the dummy based on the input data, as described in this issue Dummy variables when not all categories are present.

Based on the answers there, you can adjust your code to get dummies like this:

dataset = {
    'airline': ['SpiceJet', 'Indigo', 'Air_India']
}

input = {
    'airline': ['SpiceJet']
}

possible_categories = dataset["airline"]


dummy_input = pd.Series(input["airline"])
display(pd.get_dummies(dummy_input.astype(pd.CategoricalDtype(categories=possible_categories))))

Output:

SpiceJet Indigo Air_India
1 0 0

With more input data, it could look like this:

input_2 = {
    'airline': ['SpiceJet','Indigo','SpiceJet','Indigo','Air_India']
}

dummy_input_2 = pd.Series(input_2["airline"])
display(pd.get_dummies(dummy_input_2.astype(pd.CategoricalDtype(categories=possible_categories))))
SpiceJet Indigo Air_India
1 0 0
0 1 0
1 0 0
0 1 0
0 0 1
Kim Tang
  • 2,330
  • 2
  • 9
  • 34