How do I perform One Hot Encoding on lists in a pandas column?

Question

Suppose I have a dataframe of which one column is a list (of a unknown values and length) for example:

df = pd.DataFrame(
 {'messageLabels': [['Good', 'Other', 'Bad'],['Bad','Terrible']]}
)

I came across this solution but it isnt what I am looking for. How best to extract a Pandas column containing lists or tuples into multiple columns

in theory the resulting df would look like

messageLabels             | Good| Other| Bad| Terrible
--------------------------------------------------------
['Good', 'Other', 'Bad']  | True| True |True| False
--------------------------------------------------------
['Bad','Terrible']        |False|False |True| True

See above

piRSquared · Answer 1 · 2019-05-24T20:56:25.540

Succint

df.join(df.messageLabels.str.join('|').str.get_dummies().astype(bool))

        messageLabels   Bad   Good  Other  Terrible
0  [Good, Other, Bad]  True   True   True     False
1     [Bad, Terrible]  True  False  False      True

`sklearn`

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
dum = mlb.fit_transform(df.messageLabels)

df.join(pd.DataFrame(dum.astype(bool), df.index, mlb.classes_))

        messageLabels   Bad   Good  Other  Terrible
0  [Good, Other, Bad]  True   True   True     False
1     [Bad, Terrible]  True  False  False      True

Overdone

n = len(df)
i = np.arange(n)
l = [*map(len, df.messageLabels)]
j, u = pd.factorize(np.concatenate(df.messageLabels))

o = np.zeros((n, len(u)), bool)
o[i.repeat(l), j] = True

df.join(pd.DataFrame(o, df.index, u))

        messageLabels   Good  Other   Bad  Terrible
0  [Good, Other, Bad]   True   True  True     False
1     [Bad, Terrible]  False  False  True      True

Messing around

And inspire by Andy

df.join(pd.DataFrame([dict.fromkeys(x, True) for x in df.messageLabels]).fillna(False))

        messageLabels   Bad   Good  Other  Terrible
0  [Good, Other, Bad]  True   True   True     False
1     [Bad, Terrible]  True  False  False      True

score 4 · Accepted Answer · answered May 24 '19 at 20:28

4

Another way is to use the apply and the Series constructor:

In [11]: pd.get_dummies(df.messageLabels.apply(lambda x: pd.Series(1, x)) == 1)
Out[11]:
    Good  Other   Bad  Terrible
0   True   True  True     False
1  False  False  True      True

where

In [12]: df.messageLabels.apply(lambda x: pd.Series(1, x))
Out[12]:
   Good  Other  Bad  Terrible
0   1.0    1.0  1.0       NaN
1   NaN    NaN  1.0       1.0

To get your desired output:

In [21]: res = pd.get_dummies(df.messageLabels.apply(lambda x: pd.Series(1, x)) == 1)

In [22]: df[res.columns] = res

In [23]: df
Out[23]:
        messageLabels   Good  Other   Bad  Terrible
0  [Good, Other, Bad]   True   True  True     False
1     [Bad, Terrible]  False  False  True      True

answered May 24 '19 at 20:28

Andy Hayden

359,921
101
625
535

Heh! and explain this `pd.get_dummies(df.messageLabels.apply(lambda x: pd.Series(True, x)))` – piRSquared May 24 '19 at 20:51
@piRSquared so you can do this without the get_dummies using `df.messageLabels.apply(lambda x: pd.Series(True, x)).fillna(False)` – Andy Hayden May 24 '19 at 20:58
I added a version of that in my post. `dict.fromkeys` – piRSquared May 24 '19 at 21:01

cs95 · Answer 3 · 2019-05-24T20:09:26.750

I would do this using get_dummies and sum (or max, either of them work):

tmp = pd.DataFrame(df['messageLabels'].tolist())
pd.get_dummies(tmp, prefix='', prefix_sep='').max(level=0, axis=1).astype(bool)

    Bad   Good  Other  Terrible
0  True   True   True     False
1  True  False  False      True

You can combine this with df using join:

df.join(pd.get_dummies(tmp, prefix='', prefix_sep='')
          .max(level=0, axis=1)
          .astype(bool))

        messageLabels   Bad   Good  Other  Terrible
0  [Good, Other, Bad]  True   True   True     False
1     [Bad, Terrible]  True  False  False      True

You can also stack and pivot_table:

(pd.DataFrame(df['messageLabels'].tolist())
   .stack()
   .reset_index()
   .pivot_table(index='level_0', columns=0, aggfunc='size', fill_value=0)
   .astype(bool))

0         Bad   Good  Other  Terrible
level_0                              
0        True   True   True     False
1        True  False  False      True

How do I perform One Hot Encoding on lists in a pandas column?

3 Answers3

Succint

`sklearn`

Overdone

Messing around

Related