2

It was hard for me to come up with clear title but an example should make things more clear.

Index C1
1     [dinner]
2     [brunch, food]
3     [dinner, fancy]

Now, I'd like to create a set of binary features for each of the unique values in this column.

The example above would turn into:

Index C1               dinner  brunch  fancy food
1     [dinner]         1       0       0     0
2     [brunch, food]   0       1       0     1
3     [dinner, fancy]  1       0       1     0

Any help would be much appreciated.

madsthaks
  • 2,091
  • 6
  • 25
  • 46
  • 3
    Possible duplicate of [Pandas convert a column of list to dummies](https://stackoverflow.com/questions/29034928/pandas-convert-a-column-of-list-to-dummies) – Lev Zakharov Aug 13 '18 at 00:50
  • 3
    Look up creating dummy variables in python. Plenty of material out there on this already. https://stackoverflow.com/questions/11587782/creating-dummy-variables-in-pandas-for-python – Eric Aug 13 '18 at 00:51
  • 3
    Possible duplicate of [Creating dummy variables in pandas for python](https://stackoverflow.com/questions/11587782/creating-dummy-variables-in-pandas-for-python) – Eric Aug 13 '18 at 00:51

2 Answers2

2

For a performant solution, I recommend creating a new DataFrame by listifying your column.

pd.get_dummies(pd.DataFrame(df.C1.tolist()), prefix='', prefix_sep='')

   brunch  dinner  fancy  food
0       0       1      0     0
1       1       0      0     1
2       0       1      1     0

This is going to be so much faster than apply(pd.Series).

This works assuming lists don't have more of the same value (eg., ['dinner', ..., 'dinner']). If they do, then you'll need an extra groupby step:

(pd.get_dummies(
    pd.DataFrame(df.C1.tolist()), prefix='', prefix_sep='')
   .groupby(level=0, axis=1)
   .sum())

Well, if your data is like this, then what you're looking for isn't "binary" anymore.

cs95
  • 379,657
  • 97
  • 704
  • 746
2

Maybe using MultiLabelBinarizer

from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
pd.DataFrame(mlb.fit_transform(df.C1),columns=mlb.classes_,index=df.Index).reset_index()
Out[970]: 
   Index  brunch  dinner  fancy  food
0      1       0       1      0     0
1      2       1       0      0     1
2      3       0       1      1     0
BENY
  • 317,841
  • 20
  • 164
  • 234