1

I am trying to build up an inference pipeline. It consists of two parts. Monthly ML model training using some tabular order metadata in previous years and daily inference prediction using new orders taken on that day. There are several string categorical columns I want to include in my model which I used labelencoder to convert them into integers. I am wondering how can I make sure I convert daily inference dataset into the same categories during data preprocessing. Should I save the dictionary of labelencoder and mapping to my inference dataset? Thanks.

Lukasz Tracewski
  • 10,794
  • 3
  • 34
  • 53
larui529
  • 21
  • 1
  • 5

1 Answers1

4

Typically you'd serialise your LabelEncoder e.g. like this. You could also use pickle or joblib modules (I'd advise the latter). Code:

import joblib

joblib.dump(label_encoder, 'label_encoder.joblib')
label_encoder = joblib.load('label_encoder.joblib')

Since you're asking about dict, I presume you might refer to packing LabelEncoder into a dictionary, something I often do with dataframes. Take this example:

import pandas
from collections import defaultdict
from sklearn import preprocessing 

df = pandas.DataFrame({
    'pets': ['cat', 'dog', 'cat', 'monkey', 'dog', 'dog'], 
    'owner': ['Champ', 'Ron', 'Brick', 'Champ', 'Veronica', 'Ron'], 
    'location': ['San_Diego', 'New_York', 'New_York', 'San_Diego', 'San_Diego', 
                 'New_York']
})

d = defaultdict(preprocessing.LabelEncoder)
fit = df.apply(lambda x: d[x.name].fit_transform(x))

fit now holds encoded data. We can now reverse the encoding with:

fit.apply(lambda x: d[x.name].inverse_transform(x))

To serialise dictionary of LabelEncoder you'd follow the same route as with single one:

joblib.dump(d, 'label_encoder_dict.joblib')
Lukasz Tracewski
  • 10,794
  • 3
  • 34
  • 53