return the labels and their encoded values in sklearn LabelEncoder

Question

I'm using LabelEncoder and OneHotEncoder from sklearn in a Machine Learning project to encode the labels (country names) in the dataset. Everything works good and my model runs perfectly. The project is to classify whether a bank customer will continue with or leave the bank based on a number of features(data), including the customer's country.

My issue arises when I want to predict (classify) a new customer (one only). The data for the new customer is still not pre-processed (i.e., country names are not encoded). Something like the following:

new_customer = np.array([['France', 600, 'Male', 40, 3, 60000, 2, 1,1, 50000]])

In the online course, where I learn machine learning, the instructor opened the pre-processed dataset that included the encoded data and manually checked the code for France and updated it in the new_customer, as the following:

new_customer = np.array([[0, 0, 600, 'Male', 40, 3, 60000, 2, 1,1, 50000]])

I believe that this is not practical, there must be a way to automatically encode France to the same code used in the original dataset, or at least a way to return a list of the countries and their encoded values. Manually encoding a label seems tedious and error-prone. So how can I automate this process, or generate the codes for the labels? Thanks in advance.

you may want to check [this answer](https://stackoverflow.com/a/30267328/5741205) — MaxU - stand with Ukraine, Feb 23 '18 at 00:15

Brad Solomon · Accepted Answer · 2020-01-24T12:17:31.897

It seems like you may be looking for the .transform() method of your estimator.

>>> from sklearn.preprocessing import LabelEncoder

>>> c = ['France', 'UK', 'US', 'US', 'UK', 'China', 'France']
>>> enc = LabelEncoder().fit(c)
>>> encoded = enc.transform(c)
>>> encoded
array([1, 2, 3, 3, 2, 0, 1])

>>> encoded.transform(['France'])
array([1])

This takes the "mapping" that was learned when you called fit(c) and applies it to new data (in this case, a new label). You can see this mapping in reverse:

>>> enc.inverse_transform(encoded)
array(['France', 'UK', 'US', 'US', 'UK', 'China', 'France'], dtype='<U6')

As mentioned by the answer here, if you want to do this between Python sessions, you could serialize the estimator to disk like this:

import pickle

with open('enc.pickle', 'wb') as file:
    pickle.dump(enc, file, pickle.HIGHEST_PROTOCOL)

Then load this in a new session and transform incoming data with it.

You have a small mistake. It should be "enc.transform(['france']), note the "enc" instead of "encoded". "encoded" in your example is an array and has no transform method. — Idodo, Jun 03 '21 at 12:51

Learning is a mess · Answer 2 · 2018-02-23T01:28:45.013

In machine learning it is a custom to keep the preprocessing pipeline in memory so that, after picking its hyperparameters and training the model, you can apply the same preprocessing on the test data.

If all of that is run in the same python instance, as is common for small/middle size projects, then it means keeping your LabelEncoder online or not sending it to garbage collection. In case of running training and testing in different instances, I think the easiest solution is to store it on disk, and load it in the testing script.

I advise you to use pickle. Here is an example.

score 0 · Answer 3 · answered Nov 07 '21 at 15:02

The problem is you didn't encode the country attribute of your dataset.

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
# define example
data = ['cold', 'cold', 'warm', 'cold', 'hot', 'hot', 'warm', 'cold', 'warm', 
'hot']
values = array(data)
print(values)
# integer encode
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(values)
print(integer_encoded)
# binary encode
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
print(onehot_encoded)

output :-

['cold' 'cold' 'warm' 'cold' 'hot' 'hot' 'warm' 'cold' 'warm' 'hot']
[0 0 2 0 1 1 2 0 2 1]
[[1. 0. 0.]
 [1. 0. 0.]
 [0. 0. 1.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [1. 0. 0.]
 [0. 0. 1.]
 [0. 1. 0.]]

For your problem, this data = ['cold', 'cold', 'warm', 'cold', 'hot', 'hot', 'warm', 'cold', 'warm', 'hot'] should be your dataset's country attribute. Then you can choose the integer or binary encoding method. Then continue the learning process.

return the labels and their encoded values in sklearn LabelEncoder

3 Answers3