2

Hei,

I have different city names in the column "City" in my dataset. I would love to encode it using LabelEncoder(). However, I got quite frustrating results with negative values

df['city_enc'] = LabelEncoder().fit_transform(df['City']).astype('int8')

The new city_enc column gives me values from -128 to 127. I do not understand why LabelEncoder().fit_transform gives me negative values? I expect that it would give value from 0 to (n-1). Can anyone explain this to me?

Best regards, Lan Nguyen

  • 1
    How many unique values are in the `City` column? If you have more than 128, then you will obtain negative values. Simply drop the `astype('int8')` conversion or use a larger datatype, such as `int16` or `int32`. – Alexandru Dinu Jul 01 '21 at 12:04
  • Thanks so much, @AlexandruDinu. I have changed it to int16. City name has 1801 unique values. Now it works well. – Nguyen Ngoc Lan Jul 01 '21 at 12:25

2 Answers2

3

Most certainly this is because you are trying to encode more than 128 (0 ... 127) different cities (you can check this with len(df['City'].unique())).

When you then force a conversion to int8 you end up with negative values in order to ensure that all the labels are distinct. With int8 you have 256 different values (-128 ... 127). For example, if you encode 129 different values as int8, you will use all of the 0 ... 127 positive values, and one item will be assigned the label -128.

One simple solution is to just drop the astype('int8') conversion:

df['city_enc'] = LabelEncoder().fit_transform(df['City']) # defaults to 'int64'
Alexandru Dinu
  • 1,159
  • 13
  • 24
2

Your problem is the conversion to type int8 which can only encode values -128 to 127. Check this example:

import pandas as pd
from sklearn.preprocessing import LabelEncoder


df = pd.DataFrame({
    'City': [i for i in range(129)]
})

le = LabelEncoder()

Case 1:

df['City_enc1'] = le.fit_transform(df['City'])
print(df['City_enc1'])

>>> 0        0
1        1
2        2
3        3
4        4
      ... 
124    124
125    125
126    126
127    127
128    128
Name: City_enc1, Length: 129, dtype: int64

Case 2:

df['City_enc2'] = le.fit_transform(df['City']).astype('int8')
print(df['City_enc2'])

>>> 0        0
1        1
2        2
3        3
4        4
      ... 
124    124
125    125
126    126
127    127
128   -128
Name: City_enc2, Length: 129, dtype: int8

See that because of the conversion in the second case, LabelEncoder has to use negative values.

Better would be to not convert at all or choose int16 or above as alternatives.

afsharov
  • 4,774
  • 2
  • 10
  • 27