Python - Attempting to create binary features from a column with lists of strings

Question

It was hard for me to come up with clear title but an example should make things more clear.

Index C1
1     [dinner]
2     [brunch, food]
3     [dinner, fancy]

Now, I'd like to create a set of binary features for each of the unique values in this column.

The example above would turn into:

Index C1               dinner  brunch  fancy food
1     [dinner]         1       0       0     0
2     [brunch, food]   0       1       0     1
3     [dinner, fancy]  1       0       1     0

Any help would be much appreciated.

Possible duplicate of [Pandas convert a column of list to dummies](https://stackoverflow.com/questions/29034928/pandas-convert-a-column-of-list-to-dummies) — Lev Zakharov, Aug 13 '18 at 00:50
Look up creating dummy variables in python. Plenty of material out there on this already. https://stackoverflow.com/questions/11587782/creating-dummy-variables-in-pandas-for-python — Eric, Aug 13 '18 at 00:51
Possible duplicate of [Creating dummy variables in pandas for python](https://stackoverflow.com/questions/11587782/creating-dummy-variables-in-pandas-for-python) — Eric, Aug 13 '18 at 00:51

score 2 · Answer 1 · answered Aug 13 '18 at 00:55

For a performant solution, I recommend creating a new DataFrame by listifying your column.

pd.get_dummies(pd.DataFrame(df.C1.tolist()), prefix='', prefix_sep='')

   brunch  dinner  fancy  food
0       0       1      0     0
1       1       0      0     1
2       0       1      1     0

This is going to be so much faster than apply(pd.Series).

This works assuming lists don't have more of the same value (eg., ['dinner', ..., 'dinner']). If they do, then you'll need an extra groupby step:

(pd.get_dummies(
    pd.DataFrame(df.C1.tolist()), prefix='', prefix_sep='')
   .groupby(level=0, axis=1)
   .sum())

Well, if your data is like this, then what you're looking for isn't "binary" anymore.

score 2 · Answer 2 · answered Aug 13 '18 at 01:41

Maybe using MultiLabelBinarizer

from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
pd.DataFrame(mlb.fit_transform(df.C1),columns=mlb.classes_,index=df.Index).reset_index()
Out[970]: 
   Index  brunch  dinner  fancy  food
0      1       0       1      0     0
1      2       1       0      0     1
2      3       0       1      1     0

Python - Attempting to create binary features from a column with lists of strings

2 Answers2