Using Scikit's LabelEncoder correctly across multiple programs

Question

The basic task that I have at hand is

a) Read some tab separated data.

b) Do some basic preprocessing

c) For each categorical column use LabelEncoder to create a mapping. This is don somewhat like this

mapper={}
#Converting Categorical Data
for x in categorical_list:
     mapper[x]=preprocessing.LabelEncoder()

for x in categorical_list:
     df[x]=mapper[x].fit_transform(df.__getattr__(x))

where df is a pandas dataframe and categorical_list is a list of column headers that need to be transformed.

d) Train a classifier and save it to disk using pickle

e) Now in a different program, the model saved is loaded.

f) The test data is loaded and the same preprocessing is performed.

g) The LabelEncoder's are used for converting categorical data.

h) The model is used to predict.

Now the question that I have is, will the step g) work correctly?

As the documentation for LabelEncoder says

It can also be used to transform non-numerical labels (as long as 
they are hashable and comparable) to numerical labels.

So will each entry hash to the exact same value everytime?

If No, what is a good way to go about this. Any way to retrive the mappings of the encoder? Or an altogether different way from LabelEncoder?

You could just try this, but yes the idea is that the hash will be the same for the same inputs — EdChum, Feb 22 '15 at 10:55
I tried...It just dumps {}...how do i get those key value pairs?? — alphacentauri, Feb 22 '15 at 14:09

Artem Sobolev · Answer 1 · 2023-03-30T18:24:33.500

65

According to the LabelEncoder implementation, the pipeline you've described will work correctly if and only if you fit LabelEncoders at the test time with data that have exactly the same set of unique values.

There's a somewhat hacky way to reuse LabelEncoders you got during train. LabelEncoder has only one property, namely, classes_. You can pickle it, and then restore like

Train:

encoder = LabelEncoder()
encoder.fit(X)
numpy.save('classes.npy', encoder.classes_)

Test

encoder = LabelEncoder()
encoder.classes_ = numpy.load('classes.npy')
# Now you should be able to use encoder
# as you would do after `fit`

This seems more efficient than refitting it using the same data.

edited Mar 30 '23 at 18:24

answered Feb 22 '15 at 14:20

Artem Sobolev

5,891
1
22
40

That was the first solution I thought about too. The thing is, what if I have different values for a column that I encoded before? Those unique values will not be in LabelEncoder (and also in my models). What may be the solution here? – nope May 17 '17 at 05:50
@nope: I don't see any solutions other than to just ignore this feature, and hope the model's performance would not go down significantly. – Artem Sobolev May 22 '17 at 07:26
You can create a function with a recreate option. If the dataset changes, you recreate the `classes.npy` file. – ricoms Jun 07 '18 at 16:20
@nope: you can introduce an extra class to represent the unseen values for the mapping during training, and yes, that class will not be used anywhere during training. But once you start testing, you mostly likely get some unseen values. Your encoder will be able to handle that, and simply map it to class created earlier, namely, "unseen". – Uylenburgh Nov 26 '18 at 09:49
I managed to create the file, however, during load it comes as an empty array. Any solutions to that? – Daniel Vilas-Boas Dec 01 '19 at 18:37

score 20 · Answer 2 · answered Apr 29 '19 at 00:13

For me the easiest way was exporting LabelEncoder as .pkl file for each column. You have to export the encoder for each column after using the fit_transform() function

For example

from sklearn.preprocessing import LabelEncoder
import pickle
import pandas as pd
df_train = pd.read_csv('traing_data.csv')
le = LabelEncoder()    
df_train['Departure'] = le.fit_transform(df_train['Departure'])
#exporting the departure encoder
output = open('Departure_encoder.pkl', 'wb')
pickle.dump(le, output)
output.close()

Then in the testing project, you can load the LabelEncoder object and apply transform() function directly

from sklearn.preprocessing import LabelEncoder
import pandas as pd
df_test = pd.read_csv('testing_data.csv')
#load the encoder file
import pickle 
pkl_file = open('Departure_encoder.pkl', 'rb')
le_departure = pickle.load(pkl_file) 
pkl_file.close()
df_test['Departure'] = le_departure.transform(df_test['Departure'])

AttributeError: 'LabelEncoder' object has no attribute 'classes_' — Arun George, Oct 31 '19 at 02:53
@ArunGeorge I believe that my solution doesn't contain any mention to `classes_` please try it again and tell me If I can help — Shady Mohamed Sherif, Oct 31 '19 at 03:58
Given that you might have multiple columns you would like to transform...can you also put all the variables in an sklearn pipeline and then just save 1 object? — JustCurious, Jan 07 '21 at 08:35

score 4 · Answer 3 · answered Mar 14 '22 at 05:53

4

from sklearn.preprocessing import LabelEncoder
import joblib
import pandas as pd

df_train = pd.read_csv('traing_data.csv')
le = LabelEncoder()    
df_train['Departure'] = le.fit_transform(df_train['Departure'])

# to save encoder 
joblib.dump(le,'labelEncoder.joblib',compress=9)

# load it when test
le=joblib.load('labelEncoder.joblib')

answered Mar 14 '22 at 05:53

osama ayman

41
1

[A code-only answer is not high quality](//meta.stackoverflow.com/questions/392712/explaining-entirely-code-based-answers). While this code may be useful, you can improve it by saying why it works, how it works, when it should be used, and what its limitations are. Please [edit] your answer to include explanation and link to relevant documentation. – Muhammad Mohsin Khan Mar 15 '22 at 15:26

score 3 · Answer 4 · answered Sep 23 '16 at 11:05

What works for me is LabelEncoder().fit(X_train[col]), pickling these objects for each categorical column col and then reusing the same objects for transforming the same categorical column col in the validation dataset. Basically you have a label encoder object for each of your categorical columns.

So fit() on training data and pickle the objects/models corresponding to each column in the training dataframe X_train.
For each col in columns of validation set X_cv, load the corresponding object/model and apply the transformation by accessing the transform function as: transform(X_cv[col]).

score 0 · Answer 5 · edited Oct 21 '20 at 16:46

As I found no other post about nominal/categorical encoding. I expand on the above-mentioned solutions and share mine for OrdinalEncoder approach (which maybe was intended by the author anyways)

I did the following with OrdinalEncoder (but should work with LabelEncoder as well). Note, that I am using categories_ instead of classes_

Create an Encoder dictionary
Save it with numpy
Load it with numpy
Iterate over the dict and apply the transformation on each column

Note: np stands for numpy.

# ------- step 1 and 2 in the file/cell where the encoding shall be exported

    encoder_dict = dict()

    for nom in nominal_columns:
        enc = enc.fit(df[[nom]])
        df[[nom]] = enc.transform(df[[nom]])
        encoder_dict[nom] = [[str(cat) for cat in sublist] for sublist in enc.categories_]

    np.save('FILE_NAME.npy', encoder_dict)




# ------------ step 3 and 4 in the file where encoding shall be imported

enc = OrdinalEncoder()
encoder_dict = np.load('FILE_NAME.npy', allow_pickle=True).tolist()

    for nom in encoder_dict:
        for col in df.columns:
            if nom == col:
                enc.categories_ = encoder_dict[nom]
                df[[col]] = enc.transform(df[[col]])
    return df

I did this for OneHotEncoder but I have error: AttributeError: 'OneHotEncoder' object has no attribute 'drop_idx_' — nino, Aug 24 '21 at 10:20

score 0 · Answer 6 · answered Mar 11 '21 at 13:17

If you're already saving your model via pickle, I would do the same for the pre-processing tools.

One way to do it would be combining everything into a class:

class MyClassifier():
    def load_data(self):
        ...
    def fit(self):
        self.first_column_encoder = preprocessing.LabelEncoder()
        self.first_column_encoder.fit(...)
        ...
        self.second_column_encoder = preprocessing.LabelEncoder()
        self.second_column_encoder.fit(...)
        ...
        self.model = KNearestNeighbors(...)
        self.model.fit(...)

my_classifier = MyClassifier()
my_classifier.fit()

pickle.dump(my_classifier, file)

Note: You may want to use OrdinalEncoder instead of LabelEncoder for input categories

score -2 · Answer 7 · answered May 20 '20 at 15:59

-2

You can do this after you have encoded the values with the "le" object:

encoding = {}
for i in list(le.classes_):
    encoding[i]=le.transform([i])[0]

You will get the "encoding" dictionary with the encoding for later use, with pandas you can export this dictionary to a csv for example.

answered May 20 '20 at 15:59

geniolius

17
2

2

This doesn't work because OP's step e) explicitly says "in a different program". – Steven Jan 04 '21 at 16:14

Using Scikit's LabelEncoder correctly across multiple programs

7 Answers7

Linked