How to one-hot-encode from a pandas column composed of a list of space containing strings?

Question

I am trying to convert a single pandas column that is made up of a list of strings (containing spaces) into a one-hot-encoded columns (inputted as a string as a result of reading from excel). I have tried the approaches discussed in the solutions to How to one-hot-encode from a pandas column containing a list? but they don't work in this scenario- the individual letters are broken into columns. The other condition is that sometimes the lists can be empty as well.

More concretely, for dataframe df

Col1   Col2         Col3
 C      4     "['Chocolate cake', 'White wine', 'Peanuts']"
 A      1.7   "[]"
 B      1     "['Chocolate cake', 'Salmon']"

I would like to create:

new_df

Col1   Col2   Chocolate cake White wine Peanuts Salmon
 C      4             1           1        1      0
 A      1.7           0           0        0      0
 B      1             1           0        0      1

What is a pythonic way to create this dataframe?

EDIT:

Code for generating the pandas dataframe.

pd.DataFrame.from_dict({'Col1':['C', 'A', 'B'], 'Col2':[4, 1.7, 1], 'Col3':["['Chocolate cake', 'White wine', 'Peanuts']", \
                                                     "[]",  "['Chocolate cake', 'Salmon']" ]})

Can you code to reproduce the sample data? – Divakar Jun 11 '20 at 05:50 — Divakar, Jun 11 '20 at 05:50

Divakar · Accepted Answer · 2020-06-11T06:06:09.060

Here's one way -

import ast
dfC3 = [ast.literal_eval(i) for i in df.Col3]
ids,U = pd.factorize(np.concatenate(dfC3))
df_out = pd.DataFrame([np.isin(U,i) for i in dfC3], columns=U).astype(int)

Sample o/p -

In [50]: df_out
Out[50]: 
   Chocolate cake  Peanuts  Salmon  White wine
0               1        1       0           1
1               0        0       0           0
2               1        0       1           0

If you need it to be concatenated with input df, use pd.concat([df,df_out],axis=1).

More performant with array-assignment

We can use array-assignment to hopefully get more performance, if needed for large datasets (re-using ids,U from earlier metrhod) -

lens = list(map(len,dfC3))
mask = np.zeros((len(lens),len(U)), dtype=int)
mask[np.repeat(range(len(lens)),lens), ids] = 1
df_out = pd.DataFrame(mask, columns=U)

Thanks Divakar. I made a critical change to the question which prevents this from working for my use case, apologies for that. The input is encoded as a string because of how pd.read_excel works, unfortunately. — Melsauce, Jun 11 '20 at 05:49

score 1 · Answer 2 · answered Jun 11 '20 at 05:50

One way using pd.Series.str.findall, join and get_dummies:

new_df = df.pop("Col3").str.findall("'(.+?)'").str.join("|").str.get_dummies()
new_df = pd.concat([df, new_df], 1)
print(new_df)

Output:

  Col1  Col2  Chocolate cake  Peanuts  Salmon  White wine
0    C   4.0               1        1       0           1
1    A   1.7               0        0       0           0
2    B   1.0               1        0       1           0

score 1 · Answer 3 · answered Jun 11 '20 at 05:52

You just need to convert Col3 to series of list by using ast.literal_eval

import ast
from sklearn.preprocessing import MultiLabelBinarizer

s = df['Col3'].map(ast.literal_eval)
mlb = MultiLabelBinarizer()

df_mlb = pd.DataFrame(mlb.fit_transform(s), columns=mlb.classes_, index=df.index)

Out[29]:
   Chocolate cake  Peanuts  Salmon  White wine
0               1        1       0           1
1               0        0       0           0
2               1        0       1           0

How to one-hot-encode from a pandas column composed of a list of space containing strings?

3 Answers3