Pandas convert a column of list to dummies

Question

I have a dataframe where one column is a list of groups each of my users belongs to. Something like:

index groups  
0     ['a','b','c']
1     ['c']
2     ['b','c','e']
3     ['a','c']
4     ['b','e']

And what I would like to do is create a series of dummy columns to identify which groups each user belongs to in order to run some analyses

index  a   b   c   d   e
0      1   1   1   0   0
1      0   0   1   0   0
2      0   1   1   0   1
3      1   0   1   0   0
4      0   1   0   0   0


pd.get_dummies(df['groups'])

won't work because that just returns a column for each different list in my column.

The solution needs to be efficient as the dataframe will contain 500,000+ rows.

joris · Accepted Answer · 2023-03-26T10:29:49.330

86

Using s for your df['groups']:

In [21]: s = pd.Series({0: ['a', 'b', 'c'], 1:['c'], 2: ['b', 'c', 'e'], 3: ['a', 'c'], 4: ['b', 'e'] })

In [22]: s
Out[22]:
0    [a, b, c]
1          [c]
2    [b, c, e]
3       [a, c]
4       [b, e]
dtype: object

This is a possible solution:

In [23]: pd.get_dummies(s.explode()).groupby(level=0).sum()
Out[23]:
   a  b  c  e
0  1  1  1  0
1  0  0  1  0
2  0  1  1  1
3  1  0  1  0
4  0  1  0  1

The logic of this is:

.explode() flattens the series of lists to a series of single values (with the index keeping track of the original row number)
pd.get_dummies( ) creating the dummies
.groupby(level=0).sum() for combining the different rows that should be one row (by summing up grouped by the index (level=0), i.e. the original row number))

If this will be efficient enough, I don't know, but in any case, if performance is important, storing lists in a dataframe is not a very good idea.

Updates since original answer

Since version 0.25, s.explode() can be used to flatten the Series of lists, instead of the original s.apply(pd.Series).stack()
Since version 1.3.0, using the level keyword in aggregations is deprecated and will be removed from newer versions soon, so is recommended to use df.groupby(level=0).sum() instead of df.sum(level=0)

edited Mar 26 '23 at 10:29

answered Mar 13 '15 at 15:23

joris

133,120
36
247
202

What version of Pandas are you using? – Alex Mar 13 '15 at 15:48
@joris your probably meant this: `pd.get_dummies(s.apply(pd.Series), prefix='', prefix_sep='').sum(level=0, axis=1)` as your code outputs a series with sums not a dataframe. – Primer Mar 13 '15 at 15:49
Ah, sorry, the bracket was in the wrong place (the stack should be within the get_dummies). I am using pandas 0.15.2. @Primer Yes, I wrote that first, but I found it with stack a bit cleaner (shorter), but it gives exactly the same output. – joris Mar 13 '15 at 15:57
1

@Alex, you started with a different input (a string that formats as a list, I start from a list), but I am not sure what the OP wants. Apart from that, you did the `get_dummies` within the apply (so for each row instead of once on all), which made it slower as the approach above. – joris Mar 13 '15 at 16:15
@joris True - actually the quotes around the characters in OPs post make me think this may be the case... Undeleted. – Alex Mar 13 '15 at 16:23
@joris Also, agreed this is probably faster, but less memory efficient (as all the zeros must be stored for each of the indices before they are summed). – Alex Mar 13 '15 at 16:25

score 54 · Answer 2 · answered Jul 19 '18 at 10:48

Very fast solution in case you have a large dataframe

Using sklearn.preprocessing.MultiLabelBinarizer

import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer

df = pd.DataFrame(
    {'groups':
        [['a','b','c'],
        ['c'],
        ['b','c','e'],
        ['a','c'],
        ['b','e']]
    }, columns=['groups'])

s = df['groups']

mlb = MultiLabelBinarizer()

pd.DataFrame(mlb.fit_transform(s),columns=mlb.classes_, index=df.index)

Result:

    a   b   c   e
0   1   1   1   0
1   0   0   1   0
2   0   1   1   1
3   1   0   1   0
4   0   1   0   1

Worked for me and also was suggested here and here

score 13 · Answer 3 · answered Feb 20 '21 at 01:18

This is even faster: pd.get_dummies(df['groups'].explode()).sum(level=0)

Using .explode() instead of .apply(pd.Series).stack()

Comparing with the other solutions:

import timeit
import pandas as pd
setup = '''
import time
import pandas as pd
s = pd.Series({0:['a','b','c'],1:['c'],2:['b','c','e'],3:['a','c'],4:['b','e']})
df = s.rename('groups').to_frame()
'''
m1 = "pd.get_dummies(s.apply(pd.Series).stack()).sum(level=0)"
m2 = "df.groups.apply(lambda x: pd.Series([1] * len(x), index=x)).fillna(0, downcast='infer')"
m3 = "pd.get_dummies(df['groups'].explode()).sum(level=0)"
times = {f"m{i+1}":min(timeit.Timer(m, setup=setup).repeat(7, 1000)) for i, m in enumerate([m1, m2, m3])}
pd.DataFrame([times],index=['ms'])
#           m1        m2        m3
# ms  5.586517  3.821662  2.547167

Paulo Alves · Answer 4 · 2017-06-13T13:04:42.553

9

Even though this quest was answered, I have a faster solution:

df.groups.apply(lambda x: pd.Series([1] * len(x), index=x)).fillna(0, downcast='infer')

And, in case you have empty groups or NaN, you could just:

df.loc[df.groups.str.len() > 0].apply(lambda x: pd.Series([1] * len(x), index=x)).fillna(0, downcast='infer')

How it works

Inside the lambda, x is your list, for example ['a', 'b', 'c']. So pd.Series will be as follows:

In [2]: pd.Series([1, 1, 1], index=['a', 'b', 'c'])
Out[2]: 
a    1
b    1
c    1
dtype: int64

When all pd.Series comes together, they become pd.DataFrame and their index become columns; missing index became a column with NaN as you can see next:

In [4]: a = pd.Series([1, 1, 1], index=['a', 'b', 'c'])
In [5]: b = pd.Series([1, 1, 1], index=['a', 'b', 'd'])
In [6]: pd.DataFrame([a, b])
Out[6]: 
     a    b    c    d
0  1.0  1.0  1.0  NaN
1  1.0  1.0  NaN  1.0

Now fillna fills those NaN with 0:

In [7]: pd.DataFrame([a, b]).fillna(0)
Out[7]: 
     a    b    c    d
0  1.0  1.0  1.0  0.0
1  1.0  1.0  0.0  1.0

And downcast='infer' is to downcast from float to int:

In [11]: pd.DataFrame([a, b]).fillna(0, downcast='infer')
Out[11]: 
   a  b  c  d
0  1  1  1  0
1  1  1  0  1

PS.: It's not required the use of .fillna(0, downcast='infer').

edited Jun 13 '17 at 13:04

answered Jan 16 '17 at 12:39

Paulo Alves

388
5
14

I have tested your solution: it works like a charm. Would you mind commenting it further to explain exactly how it works ? – Mike Jun 12 '17 at 18:40
And to add a prefix to the columns use: `dummies.columns = ['D_'+col_name for col_name in dummies.columns]` – Ufos Nov 12 '17 at 23:06
2

@Ufos, you could just `.add_prefix('D_')` – Paulo Alves Nov 13 '17 at 10:19
@PauloAlves, ouch! – Ufos Nov 13 '17 at 12:57
@PauloAlves I tried your solution because the other one is too slow for my dataset, but I keep getting the following error: "InvalidIndexError: Reindexing only valid with uniquely valued Index objects". Do you have any idea of where that could come from ? In the case it comes from the index of the original dataframe, I already checked `df.index.is_unique` which outputs `True`. – atonnerre Nov 29 '17 at 12:46
@atonnerre I could reproduce the error as follows: `c = pd.Series([1, 1, 1], index=['a', 'a', 'd']); pd.DataFrame([a, b, c])`. How did you create your DataFrame? In my case `c.index.is_unique` is `True` and I have no idea why. – Paulo Alves Nov 29 '17 at 17:06
PS.: My bad, `c.index.is_unique` is `True` for `c = pd.Series([1, 1, 1], index=['a', 'c', 'd'])`, which is right. Sorry. – Paulo Alves Nov 29 '17 at 17:13
@PauloAlves Thanks for helping! I created this dataframe using `.groupby(groups)`. – atonnerre Nov 29 '17 at 17:47
@PauloAlves More specifically, I did `df = pd.DataFrame(prev_df.groupby("ID")["Group"].apply(list))`. Regarding `prev_df`, if ID1 belongs to Group1 and Group2, I have two rows, one where ID = ID1 and Group = Group1, and another where ID = ID1 and Group = Group2. Which is why I grouped them by IDs and listed the Groups to which each ID belongs. There certainly is a better way though. – atonnerre Nov 29 '17 at 17:55
@atonnerre Does it work for you `pd.DataFrame(df.groupby('ID')['Group'].apply(list)).Group.apply(pd.Series).fillna(0, downcast='infer')`? Where `df = pd.DataFrame(np.random.randint(0, 5, 30).reshape(15, 2), columns=['ID', 'Group'])`. – Paulo Alves Nov 30 '17 at 19:46

score 1 · Answer 5 · answered Jun 21 '22 at 12:22

You can use str.join to join all elements in list present in series into string and then use str.get_dummies:

out = df.join(df['groups'].str.join('|').str.get_dummies())

print(out)

      groups  a  b  c  e
0  [a, b, c]  1  1  1  0
1        [c]  0  0  1  0
2  [b, c, e]  0  1  1  1
3     [a, c]  1  0  1  0
4     [b, e]  0  1  0  1

Mykola Zotko · Answer 6 · 2022-10-07T08:51:08.963

1

You can use explode and crosstab:

s = pd.Series([['a', 'b', 'c'], ['c'], ['b', 'c', 'e'], ['a', 'c'], ['b', 'e']])

s = s.explode()
pd.crosstab(s.index, s)

Output:

col_0  a  b  c  e
row_0            
0      1  1  1  0
1      0  0  1  0
2      0  1  1  1
3      1  0  1  0
4      0  1  0  1

edited Oct 07 '22 at 08:51

answered Oct 07 '22 at 08:40

Mykola Zotko

15,583
3
71
73

Pandas convert a column of list to dummies

6 Answers6

Updates since original answer

How it works

Linked

Related