2

I have a dataframe of users who purchased various items. I want to breakout that list of values into separate columns and have a binary flag for users who purchased that item.

Input:

       A           B
0  James  [123, 456]
1   Mary       [123]
2   John  [456, 789]

Expected Output:

       A           B  123  456  789
0  James  [123, 456]    1    1    0
1   Mary       [123]    1    0    0
2   John  [456, 789]    0    1    1

What I've tried (step by step)

df['B'].explode() is my first step:

0    123
0    456
1    123
2    456
2    789

The using get_dummies() pd.get_dummies(df['B'].explode()):

   123  456  789
0    1    0    0
0    0    1    0
1    1    0    0
2    0    1    0
2    0    0    1

Join it together on index df.join(pd.get_dummies(df['B'].explode())):

       A           B  123  456  789
0  James  [123, 456]    1    0    0
0  James  [123, 456]    0    1    0
1   Mary       [123]    1    0    0
2   John  [456, 789]    0    1    0
2   John  [456, 789]    0    0    1

Problem:

Now I just need to groupby and combine. However, with millions and millions of rows and customers buying 100s of products, this method of joining/combining is highly inefficient. Is there a more "pandas-friendly" or built in function that does this?

MattR
  • 4,887
  • 9
  • 40
  • 67
  • for performance you can also try using [this solution](https://stackoverflow.com/a/51420716/9840637) using `MultiLabelBinarizer` – anky Apr 07 '20 at 14:19

1 Answers1

7

you can replace pd.get_dummies(df['B'].explode() with pd.get_dummies(df.B.explode()).sum(level=0) and join.

Quang Hoang
  • 146,074
  • 10
  • 56
  • 74