label-encoder encoding missing values

Question

I am using the label encoder to convert categorical data into numeric values.

How does LabelEncoder handle missing values?

from sklearn.preprocessing import LabelEncoder
import pandas as pd
import numpy as np
a = pd.DataFrame(['A','B','C',np.nan,'D','A'])
le = LabelEncoder()
le.fit_transform(a)

Output:

array([1, 2, 3, 0, 4, 1])

For the above example, label encoder changed NaN values to a category. How would I know which category represents missing values?

https://stackoverflow.com/a/60186800/10375049 – Marco Cerliani Mar 10 '21 at 10:38 — Marco Cerliani, Mar 10 '21 at 10:38

score 23 · Accepted Answer · answered Apr 23 '16 at 17:52

23

Don't use LabelEncoder with missing values. I don't know which version of scikit-learn you're using, but in 0.17.1 your code raises TypeError: unorderable types: str() > float().

As you can see in the source it uses numpy.unique against the data to encode, which raises TypeError if missing values are found. If you want to encode missing values, first change its type to a string:

a[pd.isnull(a)]  = 'NaN'

answered Apr 23 '16 at 17:52

dukebody

7,025
3
36
61

So you would be coding 'NaN' as a dummy value? I have the same issue but want to use the imputed value for linear regression. – Scott Davis Nov 05 '16 at 22:16
5

The model treats missing value (nan) and "Nan" differently. A working around way is using LabelEnconder with non-missing values only, and let nan values untouched: df['col'] = df['col'].map(lambda x: le.transform([x])[0] if type(x)==str else x) – Chau Pham Apr 16 '19 at 09:28

score 12 · Answer 2 · edited Aug 16 '19 at 11:13

you can also use a mask to replace form the original data frame after labelling

df = pd.DataFrame({'A': ['x', np.NaN, 'z'], 'B': [1, 6, 9], 'C': [2, 1, np.NaN]})

    A   B   C
0   x   1   2.0
1   NaN 6   1.0
2   z   9   NaN

original = df
mask = df_1.isnull()
       A    B   C
0   False   False   False
1   True    False   False
2   False   False   True

df = df.astype(str).apply(LabelEncoder().fit_transform)
df.where(~mask, original)

A   B   C
0   1.0 0   1.0
1   NaN 1   0.0
2   2.0 2   NaN

Kerem T · Answer 3 · 2017-12-08T04:29:27.580

6

Hello a little computational hack I did for my own work:

from sklearn.preprocessing import LabelEncoder
import pandas as pd
import numpy as np
a = pd.DataFrame(['A','B','C',np.nan,'D','A'])
le = LabelEncoder()
### fit with the desired col, col in position 0 for this example
fit_by = pd.Series([i for i in a.iloc[:,0].unique() if type(i) == str])
le.fit(fit_by)
### Set transformed col leaving np.NaN as they are
a["transformed"] = fit_by.apply(lambda x: le.transform([x])[0] if type(x) == str else x)

edited Dec 08 '17 at 04:29

answered May 10 '17 at 18:01

Kerem T

260
4
6

1

`fit_by` is a list, lists don't have an `.apply()` method, please correct – gboffi Dec 07 '17 at 12:42
Maybe just a typo in his answer. Use apply( function, axis=1) or map. ex: df['col'] = df['col'].map(lambda x: le.transform([x])[0] if type(x)==str else x) – Chau Pham Apr 16 '19 at 09:29

score 5 · Answer 4 · edited Jun 07 '20 at 00:10

This is my solution, because I was not pleased with the solutions posted here. I needed a LabelEncoder that keeps my missing values as NaN to use an Imputer afterwards. So I have written my own LabelEncoder class. It works with DataFrames.

from sklearn.base import BaseEstimator
from sklearn.base import TransformerMixin
from sklearn.preprocessing import LabelEncoder

class LabelEncoderByCol(BaseEstimator, TransformerMixin):
    def __init__(self,col):
        #List of column names in the DataFrame that should be encoded
        self.col = col
        #Dictionary storing a LabelEncoder for each column
        self.le_dic = {}
        for el in self.col:
            self.le_dic[el] = LabelEncoder()

    def fit(self,x,y=None):
        #Fill missing values with the string 'NaN'
        x[self.col] = x[self.col].fillna('NaN')
        for el in self.col:
            #Only use the values that are not 'NaN' to fit the Encoder
            a = x[el][x[el]!='NaN']
            self.le_dic[el].fit(a)
        return self

    def transform(self,x,y=None):
        #Fill missing values with the string 'NaN'
        x[self.col] = x[self.col].fillna('NaN')
        for el in self.col:
            #Only use the values that are not 'NaN' to fit the Encoder
            a = x[el][x[el]!='NaN']
            #Store an ndarray of the current column
            b = x[el].to_numpy()
            #Replace the elements in the ndarray that are not 'NaN'
            #using the transformer
            b[b!='NaN'] = self.le_dic[el].transform(a)
            #Overwrite the column in the DataFrame
            x[el]=b
        #return the transformed DataFrame
        return x

You can enter a DataFrame, not only a 1-dim Series. with col you can chose the columns that should be encoded.

I would like to here some feedback.

I used `newdf = LabelEncoderByCol(df)` - now how do I convert it to pandas? — Vasim, Jan 31 '19 at 05:31
@whatsnext It has the same behaviour as the original LabelEncoder with unseen values. You might need some extra lines to achieve decent behaviour. Refer to: https://stackoverflow.com/questions/21057621/sklearn-labelencoder-with-never-seen-before-values If you edit my answer to improve, I will aprove. — Niclas von Caprivi, Feb 18 '20 at 10:18

score 3 · Answer 5 · edited Dec 19 '20 at 09:41

I want to share with you my solution.
I created a module which take mix dataset and convert it from categorical to numerical and inverse.

This Module also available in my Github well organized with example.
Please upvoted if you like my solution.

Tks, Idan

class label_encoder_contain_missing_values :

        def __init__ (self) :    
            pass  

        def categorical_to_numeric (self,dataset):
            import numpy as np
            import pandas as pd
            
            self.dataset = dataset
            self.summary = None
            self.table_encoder= {}

            for index in self.dataset.columns :
                if self.dataset[index].dtypes == 'object' :               
                   column_data_frame = pd.Series(self.dataset[index],name='column').to_frame()
                   unique_values = pd.Series(self.dataset[index].unique())
                   i = 0
                   label_encoder = pd.DataFrame({'value_name':[],'Encode':[]})
                   while i <= len(unique_values)-1:
                         if unique_values.isnull()[i] == True : 
                           label_encoder = label_encoder.append({'value_name': unique_values[i],'Encode':np.nan}, ignore_index=True) #np.nan = -1
                         else:
                           label_encoder = label_encoder.append({'value_name': unique_values[i],'Encode':i}, ignore_index=True)
                         i+=1 

                   output = pd.merge(left=column_data_frame,right = label_encoder, how='left',left_on='column',right_on='value_name')
                   self.summary = output[['column','Encode']].drop_duplicates().reset_index(drop=True)
                   self.dataset[index] = output.Encode 
                   self.table_encoder.update({index:self.summary})
                    
                else :
                     pass
                     
            # ---- Show Encode Table ----- #               
            print('''\nLabel Encoding completed in Successfully.\n
                       Next steps: \n
                       1.  To view table_encoder, Execute the follow: \n
                           for index in table_encoder :
                           print(f'\\n{index} \\n',table_encoder[index])
                           
                       2. For inverse, execute the follow : \n
                          df = label_encoder_contain_missing_values().
                               inverse_numeric_to_categorical(table_encoder, df) ''') 
                        
            return self.table_encoder  ,self.dataset 
        

        def inverse_numeric_to_categorical (self,table_encoder, df):
            dataset = df.copy()
            for column in table_encoder.keys():
                df_column = df[column].to_frame()
                output = pd.merge(left=df_column,right = table_encoder[column], how='left',left_on= column,right_on='Encode')#.rename(columns={'column_x' :'encode','column_y':'category'})
                df[column]= output.column
            print('\nInverse Label Encoding, from categorical to numerical completed in Successfully.\n')
            return df
            
**execute command from categorical to numerical** <br>
table_encoder, df = label_encoder_contain_missing_values().categorical_to_numeric(df) 

**execute command from numerical to categorical** <br>
df = label_encoder_contain_missing_values().inverse_numeric_to_categorical(table_encoder, df)

score 2 · Answer 6 · answered Feb 26 '19 at 12:41

2

An easy way is this

It is an example of Titanic

LABEL_COL = ["Sex", "Embarked"]

def label(df):
    _df = df.copy()
    le = LabelEncoder()
    for col in LABEL_COL:
        # Not NaN index
        idx = ~_df[col].isna()
        _df.loc[idx, col] \
            = le.fit(_df.loc[idx, col]).transform(_df.loc[idx, col])
    return _df

answered Feb 26 '19 at 12:41

chankane

33
4

Nice, you can use fit_transform at once: _df.loc[idx, col] = le.fit_transform(_df.loc[idx, col]) – user1925772 Jan 05 '22 at 23:59

score 2 · Answer 7 · answered Aug 16 '19 at 16:16

The most voted answer by @Kerem has typos, therefore I am posting the corrected and improved answer here:

from sklearn.preprocessing import LabelEncoder
import pandas as pd
import numpy as np
a = pd.DataFrame(['A','B','C',np.nan,'D','A'])
for j in a.columns.values:
    le = LabelEncoder()
### fit with the desired col, col in position 0 for this ###example
    fit_by = pd.Series([i for i in a[j].unique() if type(i) == str])
    le.fit(fit_by)
    ### Set transformed col leaving np.NaN as they are
    a["transformed"] = a[j].apply(lambda x: le.transform([x])[0] if type(x) == str else x)

score 2 · Answer 8 · answered Sep 26 '19 at 03:10

You can handle missing values by replacing it with string 'NaN'. The category can be obtained by le.transfrom().

le.fit_transform(a.fillna('NaN'))
category = le.transform(['NaN'])

Another solution is for label encoder to ignore missing values.

a = le.fit_transform(a.astype(str))

score 1 · Answer 9 · edited Mar 13 '17 at 08:45

1

You can fill the na's by some value and later change the dataframe column type to string to make things work.

from sklearn.preprocessing import LabelEncoder
import pandas as pd
import numpy as np
a = pd.DataFrame(['A','B','C',np.nan,'D','A'])
a.fillna(99)
le = LabelEncoder()
le.fit_transform(a.astype(str))

edited Mar 13 '17 at 08:45

Alexander Farber

21,519
75
241
416

answered Mar 13 '17 at 08:18

raghu nanden

37
2

score 1 · Answer 10 · answered Oct 02 '18 at 05:32

Following encoder addresses None values in each category.

class MultiColumnLabelEncoder:
    def __init__(self):
        self.columns = None
        self.led = defaultdict(preprocessing.LabelEncoder)

    def fit(self, X):
        self.columns = X.columns
        for col in self.columns:
            cat = X[col].unique()
            cat = [x if x is not None else "None" for x in cat]
            self.led[col].fit(cat)
        return self

    def fit_transform(self, X):
        if self.columns is None:
            self.fit(X)
        return self.transform(X)

    def transform(self, X):
        return X.apply(lambda x:  self.led[x.name].transform(x.apply(lambda e: e if e is not None else "None")))

    def inverse_transform(self, X):
        return X.apply(lambda x: self.led[x.name].inverse_transform(x))

Uses Example

df = pd.DataFrame({
    'pets': ['cat', 'dog', 'cat', 'monkey', 'dog', 'dog'],
    'owner': ['Champ', 'Ron', 'Brick', None, 'Veronica', 'Ron'],
    'location': ['San_Diego', 'New_York', 'New_York', 'San_Diego', 'San_Diego',
                 None]
})


print(df)

   location     owner    pets
0  San_Diego     Champ     cat
1   New_York       Ron     dog
2   New_York     Brick     cat
3  San_Diego      None  monkey
4  San_Diego  Veronica     dog
5       None       Ron     dog

le = MultiColumnLabelEncoder()
le.fit(df)

transformed = le.transform(df)
print(transformed)

   location  owner  pets
0         2      1     0
1         0      3     1
2         0      0     0
3         2      2     2
4         2      4     1
5         1      3     1

inverted = le.inverse_transform(transformed)
print(inverted)

        location     owner    pets
0  San_Diego     Champ     cat
1   New_York       Ron     dog
2   New_York     Brick     cat
3  San_Diego      None  monkey
4  San_Diego  Veronica     dog
5       None       Ron     dog

score 1 · Answer 11 · answered Sep 08 '20 at 18:36

This function takes a column from a dataframe and return the column where only non-NaNs are label encoded, the rest remains untouched

import pandas as pd
from sklearn.preprocessing import LabelEncoder

def label_encode_column(col):
    nans = col.isnull()
    nan_lst = []
    nan_idx_lst = []
    label_lst = []
    label_idx_lst = []

    for idx, nan in enumerate(nans):
        if nan:
            nan_lst.append(col[idx])
            nan_idx_lst.append(idx)
        else:
            label_lst.append(col[idx])
            label_idx_lst.append(idx)

    nan_df = pd.DataFrame(nan_lst, index=nan_idx_lst)
    label_df = pd.DataFrame(label_lst, index=label_idx_lst) 

    label_encoder = LabelEncoder()
    label_df = label_encoder.fit_transform(label_df.astype(str))
    label_df = pd.DataFrame(label_df, index=label_idx_lst)
    final_col = pd.concat([label_df, nan_df])
    
    return final_col.sort_index()

score 0 · Answer 12 · answered Oct 07 '18 at 03:49

This is how I did it:

import pandas as pd
from sklearn.preprocessing import LabelEncoder

UNKNOWN_TOKEN = '<unknown>'
a = pd.Series(['A','B','C', 'D','A'], dtype=str).unique().tolist()
a.append(UNKNOWN_TOKEN)
le = LabelEncoder()
le.fit_transform(a)
embedding_map = dict(zip(le.classes_, le.transform(le.classes_)))

and when applying to new test data:

test_df = test_df.apply(lambda x: x if x in embedding_map else UNKNOWN_TOKEN)
le.transform(test_df)

score 0 · Answer 13 · answered May 10 '19 at 08:23

I also wanted to contribute my workaround, as I found the others a bit more tedious when working with categorical data which contains missing values

# Create a random dataframe
foo = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))

# Randomly intersperse column 'A' with missing data (NaN)
foo['A'][np.random.randint(0,len(foo), size=20)] = np.nan

# Convert this series to string, to simulate our problem
series = foo['A'].astype(str)

# np.nan are converted to the string "nan", mask these out
mask = (series == "nan")

# Apply the LabelEncoder to the unmasked series, replace the masked series with np.nan
series[~mask] = LabelEncoder().fit_transform(series[~mask])
series[mask] = np.nan

foo['A'] = series

Baligh · Answer 14 · 2021-09-03T09:53:44.190

0

This is my attempt!

import numpy as np
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
#Now lets encode the incomplete Cabin feature
titanic_train_le['Cabin'] = le.fit_transform(titanic_train_le['Cabin'].astype(str))
#get nan code for the cabin categorical feature
cabin_nan_code=le.transform(['nan'])[0]
#Now, retrieve the nan values in the encoded data
titanic_train_le['Cabin'].replace(cabin_nan_code,np.nan,inplace=True)

edited Sep 03 '21 at 09:53

answered Sep 03 '21 at 04:57

Baligh

1
1

1

Please provide additional details in your answer. As it's currently written, it's hard to understand your solution. – Community Sep 03 '21 at 05:06
Please add further details to expand on your answer, such as working code or documentation citations. – Community Sep 03 '21 at 10:23

score 0 · Answer 15 · answered Jan 09 '23 at 23:37

I just created my own encoder which can encode a dataframe at once. Using this class, None is encoded to 0. It can be handy when trying to make sparse matrix. Note that the input dataframe must include categorical columns only.

class DF_encoder():
def __init__(self):
    self.mapping = {None : 0}
    self.inverse_mapping = {0 : None}
    self.all_keys =[]

def fit(self,df:pd.DataFrame):
    for col in df.columns:
        keys = list(df[col].unique())
        self.all_keys += keys
    self.all_keys = list(set(self.all_keys))
    for i , item in enumerate(start=1 ,iterable=self.all_keys):
        if item not in self.mapping.keys():
            self.mapping[item] = i
            self.inverse_mapping[i] = item

def transform(self,df):
    temp_df = pd.DataFrame()
    for col in df.columns:
        temp_df[col] = df[col].map(self.mapping)
    return temp_df

    
def inverse_transform(self,df):
    temp_df = pd.DataFrame()
    for col in df.columns:
        temp_df[col] = df[col].map(self.inverse_mapping)

    return temp_df

score -1 · Answer 16 · answered Aug 29 '16 at 15:11

-1

I faced the same problem but none of the above worked for me. So I added a new row to the training data consisting only "nan"

answered Aug 29 '16 at 15:11

silent_dev

1,566
3
20
45

label-encoder encoding missing values

16 Answers16