Label encoding multiple columns with the same category

Question

Consider the following dataframe:

import pandas as pd
from sklearn.preprocessing import LabelEncoder

df = pd.DataFrame(data=[["France", "Italy", "Belgium"], ["Italy", "France", "Belgium"]], columns=["a", "b", "c"])
df = df.apply(LabelEncoder().fit_transform)
print(df)

It currently outputs:

   a  b  c
0  0  1  0
1  1  0  0

My goal is to make it output something like this by passing in the columns I want to share categorial values:

   a  b  c
0  0  1  2
1  1  0  2

unutbu · Accepted Answer · 2018-02-04T22:06:57.827

Pass axis=1 to call LabelEncoder().fit_transform once for each row. (By default, df.apply(func) calls func once for each column).

import pandas as pd
from sklearn.preprocessing import LabelEncoder

df = pd.DataFrame(data=[["France", "Italy", "Belgium"], 
                        ["Italy", "France", "Belgium"]], columns=["a", "b", "c"])

encoder = LabelEncoder()

df = df.apply(encoder.fit_transform, axis=1)
print(df)

yields

   a  b  c
0  1  2  0
1  2  1  0

Alternatively, you could use make the data of category dtype and use the category codes as labels:

import pandas as pd

df = pd.DataFrame(data=[["France", "Italy", "Belgium"], 
                        ["Italy", "France", "Belgium"]], columns=["a", "b", "c"])

stacked = df.stack().astype('category')
result = stacked.cat.codes.unstack()
print(result)

also yields

   a  b  c
0  1  2  0
1  2  1  0

This should be significantly faster since it does not require calling encoder.fit_transform once for each row (which might give terrible performance if you have lots of rows).

cs95 · Answer 2 · 2018-06-22T15:39:18.097

2

You can do this with pd.factorize.

df = df.stack()
df[:] = pd.factorize(df)[0]
df.unstack()

   a  b  c
0  0  1  2
1  1  0  2

In case you want to encode only some columns in the dataframe then:

temp = df[['a', 'b']].stack()
temp[:] = temp.factorize()[0]
df[['a', 'b']] = temp.unstack()

   a  b        c
0  0  1  Belgium
1  1  0  Belgium

edited Jun 22 '18 at 15:39

answered Feb 04 '18 at 22:09

cs95

379,657
97
704
746

Didn't work for me. It gives me 0 a 0 b 1 c 2 1 a 3 b 0 c 2 – Martin Feb 04 '18 at 22:21
@Martin: Check that you don't have a typo in your original `df` -- in particular "Italy" on the second row. – unutbu Feb 04 '18 at 22:26
You're right, my bad. I forgot I tried swapping out Italy with Sweden in the first row. What I meant with this question was it should apply for all rows, which the second method in the first answer does, although I can see the ambiguity in my definition now. – Martin Feb 04 '18 at 22:33

score 1 · Answer 3 · answered Feb 04 '18 at 22:40

1

If the encoding order doesn't matter, you can do:

df_new = (         
    pd.DataFrame(columns=df.columns,
                 data=LabelEncoder()
                 .fit_transform(df.values.flatten()).reshape(df.shape))
)

df_new
Out[27]: 
   a  b  c
0  1  2  0
1  2  1  0

answered Feb 04 '18 at 22:40

Allen Qin

19,507
8
51
67

Is there a way to get the mapping here? And: I actually think you solved a problem [here](https://stackoverflow.com/posts/51030443/edit) – Christopher Jun 25 '18 at 19:23

score 0 · Answer 4 · answered Jun 22 '18 at 15:18

0

Here's an alternative solution using categorical data. Similar to @unutbu's but preserves ordering of factorization. In other words, the first value found will have code 0.

df = pd.DataFrame(data=[["France", "Italy", "Belgium"],
                        ["Italy", "France", "Belgium"]],
                  columns=["a", "b", "c"])

# get unique values in order
vals = df.T.stack().unique()

# convert to categories and then extract codes
for col in df:
    df[col] = pd.Categorical(df[col], categories=vals)
    df[col] = df[col].cat.codes

print(df)

   a  b  c
0  0  1  2
1  1  0  2

answered Jun 22 '18 at 15:18

jpp

159,742
34
281
339

`pd.factorize()` does the exact same thing na ? – Bharath M Shetty Jun 22 '18 at 15:22
@Dark, That's true. The only reason you'd choose this solution over factorize is you intend to use categorical features (e.g. validation) and code view is just an alternative representation. – jpp Jun 22 '18 at 15:25

Label encoding multiple columns with the same category

4 Answers4

Linked

Related