0

I am trying to convert a single pandas column that is made up of a list of strings (containing spaces) into a one-hot-encoded columns (inputted as a string as a result of reading from excel). I have tried the approaches discussed in the solutions to How to one-hot-encode from a pandas column containing a list? but they don't work in this scenario- the individual letters are broken into columns. The other condition is that sometimes the lists can be empty as well.

More concretely, for dataframe df

Col1   Col2         Col3
 C      4     "['Chocolate cake', 'White wine', 'Peanuts']"
 A      1.7   "[]"
 B      1     "['Chocolate cake', 'Salmon']" 

I would like to create:

new_df

Col1   Col2   Chocolate cake White wine Peanuts Salmon
 C      4             1           1        1      0
 A      1.7           0           0        0      0
 B      1             1           0        0      1

What is a pythonic way to create this dataframe?

EDIT:

Code for generating the pandas dataframe.

pd.DataFrame.from_dict({'Col1':['C', 'A', 'B'], 'Col2':[4, 1.7, 1], 'Col3':["['Chocolate cake', 'White wine', 'Peanuts']", \
                                                     "[]",  "['Chocolate cake', 'Salmon']" ]})
Melsauce
  • 2,535
  • 2
  • 19
  • 39

3 Answers3

1

Here's one way -

import ast
dfC3 = [ast.literal_eval(i) for i in df.Col3]
ids,U = pd.factorize(np.concatenate(dfC3))
df_out = pd.DataFrame([np.isin(U,i) for i in dfC3], columns=U).astype(int)

Sample o/p -

In [50]: df_out
Out[50]: 
   Chocolate cake  Peanuts  Salmon  White wine
0               1        1       0           1
1               0        0       0           0
2               1        0       1           0

If you need it to be concatenated with input df, use pd.concat([df,df_out],axis=1).


More performant with array-assignment

We can use array-assignment to hopefully get more performance, if needed for large datasets (re-using ids,U from earlier metrhod) -

lens = list(map(len,dfC3))
mask = np.zeros((len(lens),len(U)), dtype=int)
mask[np.repeat(range(len(lens)),lens), ids] = 1
df_out = pd.DataFrame(mask, columns=U)
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • Thanks Divakar. I made a critical change to the question which prevents this from working for my use case, apologies for that. The input is encoded as a string because of how pd.read_excel works, unfortunately. – Melsauce Jun 11 '20 at 05:49
1

One way using pd.Series.str.findall, join and get_dummies:

new_df = df.pop("Col3").str.findall("'(.+?)'").str.join("|").str.get_dummies()
new_df = pd.concat([df, new_df], 1)
print(new_df)

Output:

  Col1  Col2  Chocolate cake  Peanuts  Salmon  White wine
0    C   4.0               1        1       0           1
1    A   1.7               0        0       0           0
2    B   1.0               1        0       1           0
Chris
  • 29,127
  • 3
  • 28
  • 51
1

You just need to convert Col3 to series of list by using ast.literal_eval

import ast
from sklearn.preprocessing import MultiLabelBinarizer

s = df['Col3'].map(ast.literal_eval)
mlb = MultiLabelBinarizer()

df_mlb = pd.DataFrame(mlb.fit_transform(s), columns=mlb.classes_, index=df.index)

Out[29]:
   Chocolate cake  Peanuts  Salmon  White wine
0               1        1       0           1
1               0        0       0           0
2               1        0       1           0
Andy L.
  • 24,909
  • 4
  • 17
  • 29