35

The basic task that I have at hand is

a) Read some tab separated data.

b) Do some basic preprocessing

c) For each categorical column use LabelEncoder to create a mapping. This is don somewhat like this

mapper={}
#Converting Categorical Data
for x in categorical_list:
     mapper[x]=preprocessing.LabelEncoder()

for x in categorical_list:
     df[x]=mapper[x].fit_transform(df.__getattr__(x))

where df is a pandas dataframe and categorical_list is a list of column headers that need to be transformed.

d) Train a classifier and save it to disk using pickle

e) Now in a different program, the model saved is loaded.

f) The test data is loaded and the same preprocessing is performed.

g) The LabelEncoder's are used for converting categorical data.

h) The model is used to predict.

Now the question that I have is, will the step g) work correctly?

As the documentation for LabelEncoder says

It can also be used to transform non-numerical labels (as long as 
they are hashable and comparable) to numerical labels.

So will each entry hash to the exact same value everytime?

If No, what is a good way to go about this. Any way to retrive the mappings of the encoder? Or an altogether different way from LabelEncoder?

alphacentauri
  • 1,000
  • 3
  • 14
  • 26

7 Answers7

65

According to the LabelEncoder implementation, the pipeline you've described will work correctly if and only if you fit LabelEncoders at the test time with data that have exactly the same set of unique values.

There's a somewhat hacky way to reuse LabelEncoders you got during train. LabelEncoder has only one property, namely, classes_. You can pickle it, and then restore like

Train:

encoder = LabelEncoder()
encoder.fit(X)
numpy.save('classes.npy', encoder.classes_)

Test

encoder = LabelEncoder()
encoder.classes_ = numpy.load('classes.npy')
# Now you should be able to use encoder
# as you would do after `fit`

This seems more efficient than refitting it using the same data.

Artem Sobolev
  • 5,891
  • 1
  • 22
  • 40
  • That was the first solution I thought about too. The thing is, what if I have different values for a column that I encoded before? Those unique values will not be in LabelEncoder (and also in my models). What may be the solution here? – nope May 17 '17 at 05:50
  • @nope: I don't see any solutions other than to just ignore this feature, and hope the model's performance would not go down significantly. – Artem Sobolev May 22 '17 at 07:26
  • You can create a function with a recreate option. If the dataset changes, you recreate the `classes.npy` file. – ricoms Jun 07 '18 at 16:20
  • @nope: you can introduce an extra class to represent the unseen values for the mapping during training, and yes, that class will not be used anywhere during training. But once you start testing, you mostly likely get some unseen values. Your encoder will be able to handle that, and simply map it to class created earlier, namely, "unseen". – Uylenburgh Nov 26 '18 at 09:49
  • I managed to create the file, however, during load it comes as an empty array. Any solutions to that? – Daniel Vilas-Boas Dec 01 '19 at 18:37
20

For me the easiest way was exporting LabelEncoder as .pkl file for each column. You have to export the encoder for each column after using the fit_transform() function

For example

from sklearn.preprocessing import LabelEncoder
import pickle
import pandas as pd
df_train = pd.read_csv('traing_data.csv')
le = LabelEncoder()    
df_train['Departure'] = le.fit_transform(df_train['Departure'])
#exporting the departure encoder
output = open('Departure_encoder.pkl', 'wb')
pickle.dump(le, output)
output.close()

Then in the testing project, you can load the LabelEncoder object and apply transform() function directly

from sklearn.preprocessing import LabelEncoder
import pandas as pd
df_test = pd.read_csv('testing_data.csv')
#load the encoder file
import pickle 
pkl_file = open('Departure_encoder.pkl', 'rb')
le_departure = pickle.load(pkl_file) 
pkl_file.close()
df_test['Departure'] = le_departure.transform(df_test['Departure'])
Shady Mohamed Sherif
  • 15,003
  • 4
  • 45
  • 54
4
from sklearn.preprocessing import LabelEncoder
import joblib
import pandas as pd

df_train = pd.read_csv('traing_data.csv')
le = LabelEncoder()    
df_train['Departure'] = le.fit_transform(df_train['Departure'])

# to save encoder 
joblib.dump(le,'labelEncoder.joblib',compress=9)

# load it when test
le=joblib.load('labelEncoder.joblib')
  • [A code-only answer is not high quality](//meta.stackoverflow.com/questions/392712/explaining-entirely-code-based-answers). While this code may be useful, you can improve it by saying why it works, how it works, when it should be used, and what its limitations are. Please [edit] your answer to include explanation and link to relevant documentation. – Muhammad Mohsin Khan Mar 15 '22 at 15:26
3

What works for me is LabelEncoder().fit(X_train[col]), pickling these objects for each categorical column col and then reusing the same objects for transforming the same categorical column col in the validation dataset. Basically you have a label encoder object for each of your categorical columns.

  1. So fit() on training data and pickle the objects/models corresponding to each column in the training dataframe X_train.
  2. For each col in columns of validation set X_cv, load the corresponding object/model and apply the transformation by accessing the transform function as: transform(X_cv[col]).
wannabe_nerd
  • 195
  • 1
  • 7
0

As I found no other post about nominal/categorical encoding. I expand on the above-mentioned solutions and share mine for OrdinalEncoder approach (which maybe was intended by the author anyways)

I did the following with OrdinalEncoder (but should work with LabelEncoder as well). Note, that I am using categories_ instead of classes_

  1. Create an Encoder dictionary
  2. Save it with numpy
  3. Load it with numpy
  4. Iterate over the dict and apply the transformation on each column

Note: np stands for numpy.

# ------- step 1 and 2 in the file/cell where the encoding shall be exported

    encoder_dict = dict()

    for nom in nominal_columns:
        enc = enc.fit(df[[nom]])
        df[[nom]] = enc.transform(df[[nom]])
        encoder_dict[nom] = [[str(cat) for cat in sublist] for sublist in enc.categories_]

    np.save('FILE_NAME.npy', encoder_dict)




# ------------ step 3 and 4 in the file where encoding shall be imported

enc = OrdinalEncoder()
encoder_dict = np.load('FILE_NAME.npy', allow_pickle=True).tolist()

    for nom in encoder_dict:
        for col in df.columns:
            if nom == col:
                enc.categories_ = encoder_dict[nom]
                df[[col]] = enc.transform(df[[col]])
    return df
Dharman
  • 30,962
  • 25
  • 85
  • 135
Createdd
  • 865
  • 11
  • 15
  • I did this for OneHotEncoder but I have error: AttributeError: 'OneHotEncoder' object has no attribute 'drop_idx_' – nino Aug 24 '21 at 10:20
0

If you're already saving your model via pickle, I would do the same for the pre-processing tools.

One way to do it would be combining everything into a class:

class MyClassifier():
    def load_data(self):
        ...
    def fit(self):
        self.first_column_encoder = preprocessing.LabelEncoder()
        self.first_column_encoder.fit(...)
        ...
        self.second_column_encoder = preprocessing.LabelEncoder()
        self.second_column_encoder.fit(...)
        ...
        self.model = KNearestNeighbors(...)
        self.model.fit(...)
my_classifier = MyClassifier()
my_classifier.fit()

pickle.dump(my_classifier, file)

Note: You may want to use OrdinalEncoder instead of LabelEncoder for input categories

Marko Knöbl
  • 447
  • 2
  • 9
-2

You can do this after you have encoded the values with the "le" object:

encoding = {}
for i in list(le.classes_):
    encoding[i]=le.transform([i])[0]

You will get the "encoding" dictionary with the encoding for later use, with pandas you can export this dictionary to a csv for example.

geniolius
  • 17
  • 2