0

I am trying to find an elegant way to label encode multiple columns in a Pandas dataframe with the same encoder. Suppose I have two columns that correspond to two-sport teams playing a match.

The solution that works for me is taking those two columns, converting to numpy array, flattening them, encoding, and then unflattening them and sticking them back into the dataframe. However, this seems to be more complicated than it should be, especially using the pandas+scikitlearn combination.

.

I will be happy with all the solutions, thanks in advance!

P.S. I want something similar to the same shared encoder for columns, but I don't want to install an external library just for this.

EDIT UPDATE: I have tried this example suggested by Ben Reiniger, but it is not what I want. More specifically:

import pandas as pd
from sklearn.preprocessing import OrdinalEncoder
data = pd.DataFrame({'Name':['Tom', 'Jack', 'Steve', 'Ricky'],
'Name1':['Mary','Tom', 'Ricky', 'Steve'],})

oe = OrdinalEncoder()

t_data = oe.fit_transform(data)

The result I get is the following:

[[3. 0.]
 [0. 3.]
 [2. 1.]
 [1. 2.]]

For the data:

<table border="1" class="dataframe">\n  <thead>\n    <tr style="text-align: right;">\n      <th></th>\n      <th>Name</th>\n      <th>Name1</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>Tom</td>\n      <td>Mary</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>Jack</td>\n      <td>Tom</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>Steve</td>\n      <td>Ricky</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>Ricky</td>\n      <td>Steve</td>\n    </tr>\n  </tbody>\n</table>'

So as you can see Jack and Mary are encoded as the number - "0.0" It means that the encoder takes all the columns individually.

The result I would like to have for this example is encoding all the columns as if they were one:

Jack - 0 
Mary - 1 
Ricky - 2 
Steve - 3 
Tom - 4

Even though Jack is only encountered in the first column and Mary in the second.

The resulting array should look like the following:

[[4. 1.]
 [0. 4.]
 [3. 2.]
 [2. 3.]]

Ideally, it should stay the pandas dataframe.

EDIT UPDATE 2: My current code that works for me and I want to optimize is the following:

# data is Pandas dataframe
data = data.to_numpy()
X = data[:, [3, 4]] # I choose the columns that I want to encode
X = X.flatten()
label_encoder = LabelEncoder()
X = label_encoder.fit_transform(X)
data[:, [3, 4]] = np.reshape(X, (-1, 2))
data = pd.DataFrame(data=data, columns=names)
beapea
  • 5
  • 4

1 Answers1

0

You need to fit a LabelEncoder on the set of unique values, which you can find by finding each column's unique values and concatenating them:

name_uniques = data.Name.unique()
name1_uniques = data.Name1.unique()

uniques = np.unique(np.concatenate((name_uniques,name1_uniques),0))

Then you can fit your encoder on the uniques and transform your data:

label_encoder = LabelEncoder()
label_encoder.fit(uniques)

data.Name = label_encoder.transform(data.Name)
data.Name1 = label_encoder.transform(data.Name1)

Note that if it's actually sports teams, this method likely won't give you great results, because label encoding is giving the values ordinality where there isn't any.

Chris Schmitz
  • 618
  • 1
  • 6
  • 16