How to create dummies from list with multiple values and predefined categories?

Question

I'd like to transform this :

In [4]: df
Out[4]:
      label
0     (a, e)
1     (a, d)
2       (b,)
3     (d, e)

to This :

   a  b  c  d  e
0  1  0  0  0  1
1  1  0  0  1  0
2  0  1  0  0  0
3  0  0  0  1  1

As you can see there are predefined columns, 'a', 'b', 'c', 'd', 'e' and c is empty but still exists.

I tried multiple things like this : df.str.join('|').str.get_dummies() first without all the columns just to get the dummies with multiple values in the input but I want to add the predefined columns thing to it.

Thank you for your help !

Is your `label` columns a series of tuples or of strings? – Quang Hoang Jul 08 '19 at 17:01 — Quang Hoang, Jul 08 '19 at 17:01

ALollz · Answer 1 · 2019-07-08T17:09:10.957

3

Create a new DataFrame, then stack + get_dummies. any along the original index for the dummies.

pd.get_dummies(pd.DataFrame([*df.label], index=df.index).stack()).any(level=0).astype(int)

   a  b  d  e
0  1  0  0  1
1  1  0  1  0
2  0  1  0  0
3  0  0  1  1

Because you have pre-defined columns, we can reindex and fill missing with 0.

res = pd.get_dummies(pd.DataFrame([*df.label], index=df.index).stack()).any(level=0)
res = res.reindex(list('abcde'), axis=1).fillna(0).astype(int)

#   a  b  c  d  e
#0  1  0  0  0  1
#1  1  0  0  1  0
#2  0  1  0  0  0
#3  0  0  0  1  1

edited Jul 08 '19 at 17:09

answered Jul 08 '19 at 17:05

ALollz

57,915
7
66
89

1

Thank you !! It worked when I used df.str.join('|').str.get_dummies() and then df.reindex(columns = ['a','b','c','d','e'], fill_value=0) – Bilal Alauddin Jul 09 '19 at 08:24

score 3 · Answer 2 · answered Jul 08 '19 at 17:08

3

Good practice for sklearn

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()

print (pd.DataFrame(mlb.fit_transform(df['label']),columns=mlb.classes_, index=df.index))

answered Jul 08 '19 at 17:08

BENY

317,841
20
164
234

I find the answer, thank you very much I'll keep that in mind for the next time – Bilal Alauddin Jul 09 '19 at 08:26

score 2 · Answer 3 · answered Jul 08 '19 at 17:04

2

Try this:

df['label'].str.join(sep='*').str.get_dummies(sep='*')

answered Jul 08 '19 at 17:04

Ankit Agrawal

616
9
20

Thank you ! :) The only problem is that it won't create the 'c' column as shown in my example, I found the solution it's in the comment of the first answer – Bilal Alauddin Jul 09 '19 at 08:25

How to create dummies from list with multiple values and predefined categories?

3 Answers3