73

I have a dataframe where one column is a list of groups each of my users belongs to. Something like:

index groups  
0     ['a','b','c']
1     ['c']
2     ['b','c','e']
3     ['a','c']
4     ['b','e']

And what I would like to do is create a series of dummy columns to identify which groups each user belongs to in order to run some analyses

index  a   b   c   d   e
0      1   1   1   0   0
1      0   0   1   0   0
2      0   1   1   0   1
3      1   0   1   0   0
4      0   1   0   0   0


pd.get_dummies(df['groups'])

won't work because that just returns a column for each different list in my column.

The solution needs to be efficient as the dataframe will contain 500,000+ rows.

Sunderam Dubey
  • 1
  • 11
  • 20
  • 40
user2900369
  • 789
  • 1
  • 7
  • 9

6 Answers6

86

Using s for your df['groups']:

In [21]: s = pd.Series({0: ['a', 'b', 'c'], 1:['c'], 2: ['b', 'c', 'e'], 3: ['a', 'c'], 4: ['b', 'e'] })

In [22]: s
Out[22]:
0    [a, b, c]
1          [c]
2    [b, c, e]
3       [a, c]
4       [b, e]
dtype: object

This is a possible solution:

In [23]: pd.get_dummies(s.explode()).groupby(level=0).sum()
Out[23]:
   a  b  c  e
0  1  1  1  0
1  0  0  1  0
2  0  1  1  1
3  1  0  1  0
4  0  1  0  1

The logic of this is:

  • .explode() flattens the series of lists to a series of single values (with the index keeping track of the original row number)
  • pd.get_dummies( ) creating the dummies
  • .groupby(level=0).sum() for combining the different rows that should be one row (by summing up grouped by the index (level=0), i.e. the original row number))

If this will be efficient enough, I don't know, but in any case, if performance is important, storing lists in a dataframe is not a very good idea.

Updates since original answer

  • Since version 0.25, s.explode() can be used to flatten the Series of lists, instead of the original s.apply(pd.Series).stack()

  • Since version 1.3.0, using the level keyword in aggregations is deprecated and will be removed from newer versions soon, so is recommended to use df.groupby(level=0).sum() instead of df.sum(level=0)

joris
  • 133,120
  • 36
  • 247
  • 202
  • What version of Pandas are you using? – Alex Mar 13 '15 at 15:48
  • @joris your probably meant this: `pd.get_dummies(s.apply(pd.Series), prefix='', prefix_sep='').sum(level=0, axis=1)` as your code outputs a series with sums not a dataframe. – Primer Mar 13 '15 at 15:49
  • Ah, sorry, the bracket was in the wrong place (the stack should be within the get_dummies). I am using pandas 0.15.2. @Primer Yes, I wrote that first, but I found it with stack a bit cleaner (shorter), but it gives exactly the same output. – joris Mar 13 '15 at 15:57
  • 1
    @Alex, you started with a different input (a string that formats as a list, I start from a list), but I am not sure what the OP wants. Apart from that, you did the `get_dummies` within the apply (so for each row instead of once on all), which made it slower as the approach above. – joris Mar 13 '15 at 16:15
  • @joris True - actually the quotes around the characters in OPs post make me think this may be the case... Undeleted. – Alex Mar 13 '15 at 16:23
  • @joris Also, agreed this is probably faster, but less memory efficient (as all the zeros must be stored for each of the indices before they are summed). – Alex Mar 13 '15 at 16:25
54

Very fast solution in case you have a large dataframe

Using sklearn.preprocessing.MultiLabelBinarizer

import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer

df = pd.DataFrame(
    {'groups':
        [['a','b','c'],
        ['c'],
        ['b','c','e'],
        ['a','c'],
        ['b','e']]
    }, columns=['groups'])

s = df['groups']

mlb = MultiLabelBinarizer()

pd.DataFrame(mlb.fit_transform(s),columns=mlb.classes_, index=df.index)

Result:

    a   b   c   e
0   1   1   1   0
1   0   0   1   0
2   0   1   1   1
3   1   0   1   0
4   0   1   0   1

Worked for me and also was suggested here and here

Teoretic
  • 2,483
  • 1
  • 19
  • 28
13

This is even faster: pd.get_dummies(df['groups'].explode()).sum(level=0)

Using .explode() instead of .apply(pd.Series).stack()

Comparing with the other solutions:

import timeit
import pandas as pd
setup = '''
import time
import pandas as pd
s = pd.Series({0:['a','b','c'],1:['c'],2:['b','c','e'],3:['a','c'],4:['b','e']})
df = s.rename('groups').to_frame()
'''
m1 = "pd.get_dummies(s.apply(pd.Series).stack()).sum(level=0)"
m2 = "df.groups.apply(lambda x: pd.Series([1] * len(x), index=x)).fillna(0, downcast='infer')"
m3 = "pd.get_dummies(df['groups'].explode()).sum(level=0)"
times = {f"m{i+1}":min(timeit.Timer(m, setup=setup).repeat(7, 1000)) for i, m in enumerate([m1, m2, m3])}
pd.DataFrame([times],index=['ms'])
#           m1        m2        m3
# ms  5.586517  3.821662  2.547167
RBA
  • 783
  • 8
  • 20
9

Even though this quest was answered, I have a faster solution:

df.groups.apply(lambda x: pd.Series([1] * len(x), index=x)).fillna(0, downcast='infer')

And, in case you have empty groups or NaN, you could just:

df.loc[df.groups.str.len() > 0].apply(lambda x: pd.Series([1] * len(x), index=x)).fillna(0, downcast='infer')

How it works

Inside the lambda, x is your list, for example ['a', 'b', 'c']. So pd.Series will be as follows:

In [2]: pd.Series([1, 1, 1], index=['a', 'b', 'c'])
Out[2]: 
a    1
b    1
c    1
dtype: int64

When all pd.Series comes together, they become pd.DataFrame and their index become columns; missing index became a column with NaN as you can see next:

In [4]: a = pd.Series([1, 1, 1], index=['a', 'b', 'c'])
In [5]: b = pd.Series([1, 1, 1], index=['a', 'b', 'd'])
In [6]: pd.DataFrame([a, b])
Out[6]: 
     a    b    c    d
0  1.0  1.0  1.0  NaN
1  1.0  1.0  NaN  1.0

Now fillna fills those NaN with 0:

In [7]: pd.DataFrame([a, b]).fillna(0)
Out[7]: 
     a    b    c    d
0  1.0  1.0  1.0  0.0
1  1.0  1.0  0.0  1.0

And downcast='infer' is to downcast from float to int:

In [11]: pd.DataFrame([a, b]).fillna(0, downcast='infer')
Out[11]: 
   a  b  c  d
0  1  1  1  0
1  1  1  0  1

PS.: It's not required the use of .fillna(0, downcast='infer').

Paulo Alves
  • 388
  • 5
  • 14
  • I have tested your solution: it works like a charm. Would you mind commenting it further to explain exactly how it works ? – Mike Jun 12 '17 at 18:40
  • And to add a prefix to the columns use: `dummies.columns = ['D_'+col_name for col_name in dummies.columns]` – Ufos Nov 12 '17 at 23:06
  • 2
    @Ufos, you could just `.add_prefix('D_')` – Paulo Alves Nov 13 '17 at 10:19
  • @PauloAlves, ouch! – Ufos Nov 13 '17 at 12:57
  • @PauloAlves I tried your solution because the other one is too slow for my dataset, but I keep getting the following error: "InvalidIndexError: Reindexing only valid with uniquely valued Index objects". Do you have any idea of where that could come from ? In the case it comes from the index of the original dataframe, I already checked `df.index.is_unique` which outputs `True`. – atonnerre Nov 29 '17 at 12:46
  • @atonnerre I could reproduce the error as follows: `c = pd.Series([1, 1, 1], index=['a', 'a', 'd']); pd.DataFrame([a, b, c])`. How did you create your DataFrame? In my case `c.index.is_unique` is `True` and I have no idea why. – Paulo Alves Nov 29 '17 at 17:06
  • PS.: My bad, `c.index.is_unique` is `True` for `c = pd.Series([1, 1, 1], index=['a', 'c', 'd'])`, which is right. Sorry. – Paulo Alves Nov 29 '17 at 17:13
  • @PauloAlves Thanks for helping! I created this dataframe using `.groupby(groups)`. – atonnerre Nov 29 '17 at 17:47
  • @PauloAlves More specifically, I did `df = pd.DataFrame(prev_df.groupby("ID")["Group"].apply(list))`. Regarding `prev_df`, if ID1 belongs to Group1 and Group2, I have two rows, one where ID = ID1 and Group = Group1, and another where ID = ID1 and Group = Group2. Which is why I grouped them by IDs and listed the Groups to which each ID belongs. There certainly is a better way though. – atonnerre Nov 29 '17 at 17:55
  • @atonnerre Does it work for you `pd.DataFrame(df.groupby('ID')['Group'].apply(list)).Group.apply(pd.Series).fillna(0, downcast='infer')`? Where `df = pd.DataFrame(np.random.randint(0, 5, 30).reshape(15, 2), columns=['ID', 'Group'])`. – Paulo Alves Nov 30 '17 at 19:46
1

You can use str.join to join all elements in list present in series into string and then use str.get_dummies:

out = df.join(df['groups'].str.join('|').str.get_dummies())
print(out)

      groups  a  b  c  e
0  [a, b, c]  1  1  1  0
1        [c]  0  0  1  0
2  [b, c, e]  0  1  1  1
3     [a, c]  1  0  1  0
4     [b, e]  0  1  0  1
Ynjxsjmh
  • 28,441
  • 6
  • 34
  • 52
1

You can use explode and crosstab:

s = pd.Series([['a', 'b', 'c'], ['c'], ['b', 'c', 'e'], ['a', 'c'], ['b', 'e']])

s = s.explode()
pd.crosstab(s.index, s)

Output:

col_0  a  b  c  e
row_0            
0      1  1  1  0
1      0  0  1  0
2      0  1  1  1
3      1  0  1  0
4      0  1  0  1
Mykola Zotko
  • 15,583
  • 3
  • 71
  • 73