Split pandas column and create new columns that count the split values

Question

I have a goofy data where one column contains multiple values slammed together with a comma:

In [62]: df = pd.DataFrame({'U': ['foo', 'bar', 'baz'], 'V': ['a,b,a,c,d', 'a,b,c', 'd,e']})                                     

In [63]: df                                                                                                                      
Out[63]: 
     U          V
0  foo  a,b,a,c,d
1  bar      a,b,c
2  baz        d,e

Now I want to split column V, drop it, and add columns a through e. Columns a through e should contains the count of the occurrences of that letter in that row:

In [62]: df = pd.DataFrame({'U': ['foo', 'bar', 'baz'], 'V': ['a,b,a,c,d', 'a,b,c', 'd,e']})                                     

In [63]: df                                                                                                                      
Out[63]: 
     U  a  b  c  d  e
0  foo  2  1  1  1  0
1  bar  1  1  1  0  0
2  baz  0  0  0  1  1

Maybe some combination of df['V'].str.split(',') and pandas.get_dummies but I can't quite work it out.

Edit: apparently I have to justify why my question is not a duplicate. I think why is intuitively obvious to the most casual observer.

Possible duplicate of [Pandas split column of lists into multiple columns](https://stackoverflow.com/questions/35491274/pandas-split-column-of-lists-into-multiple-columns) — G. Anderson, Jul 16 '19 at 21:56

BENY · Accepted Answer · 2019-07-16T22:14:20.340

5

This is str.get_dummies

pd.concat([df,df.pop('V').str.split(',',expand=True).stack().str.get_dummies().sum(level=0)],1)
Out[602]: 
     U  a  b  c  d  e
0  foo  2  1  1  1  0
1  bar  1  1  1  0  0
2  baz  0  0  0  1  1

edited Jul 16 '19 at 22:14

answered Jul 16 '19 at 21:56

BENY

317,841
20
164
234

1

Once again this answer shows your deep knowledge of pandas. Great answer! Cant upvote this enough – Erfan Jul 16 '19 at 22:03
You can add a `.drop("V",1)` to match the desired output exactly. – pault Jul 16 '19 at 22:05
@pault using pop :-) – BENY Jul 16 '19 at 22:14
1

All in one line. Ouch! – Paul Coccoli Jul 17 '19 at 03:06

benvc · Answer 2 · 2019-07-17T16:25:45.720

1

You could just use pandas.Series.str.count. For example:

import pandas as pd

df = pd.DataFrame({'U': ['foo', 'bar', 'baz'], 'V': ['a,b,a,c,d', 'a,b,c', 'd,e']})

columns = ['a', 'b', 'c', 'd', 'e']
# If unknown or large set of columns, then replace above with:
# columns = sorted(set(df['V'].str.split(',').sum()))

for column in columns:
    df[column] = df['V'].str.count(column)

print(df)
#      U          V  a  b  c  d  e
# 0  foo  a,b,a,c,d  2  1  1  1  0
# 1  bar      a,b,c  1  1  1  0  0
# 2  baz        d,e  0  0  0  1  1

edited Jul 17 '19 at 16:25

answered Jul 16 '19 at 22:03

benvc

14,448
4
33
54

Only works if you know the possible values in advance – pault Jul 16 '19 at 22:06
@pault - you are right. If column values are unknown, then you could replace the list in the for loop with the result of something like `sorted(set(df['V'].str.split(',').sum()))` instead. – benvc Jul 16 '19 at 22:15
I do happen to know all possible values, but that seems tedious. The version in the comments is nice, though! – Paul Coccoli Jul 17 '19 at 03:08

Split pandas column and create new columns that count the split values

2 Answers2

Linked