0

I'm wondering if there's a better way to split a string column into multiple rows by 'comma' and send the column from long to wide. My current approach is only working to some extent.

data = {'id':["ab3e3", "psdds2", "pas13", "ccdf2", "dsda1"],
        'fruit':["apple, organge", "others", "dragon fruit, organge", "watermelon", "others"]}

df = pd.DataFrame(data)

lst_col = 'fruit'
x = df.assign(**{lst_col:df[lst_col].str.split(',')})
dt = x.explode('fruit')
dt['counts'] = 1
dt.pivot(index='id', columns='fruit', values = 'counts')

       id                  fruit
0   ab3e3         apple, organge
1  psdds2                 others
2   pas13  dragon fruit, organge
3   ccdf2             watermelon
4   dsda1                 others

Expected output:

id        organge  apple  dragon fruit  others  watermelon
ab3e3         1      1             0       0           0 
ccdf2         0      0             0       0           1
dsda1         0      0             0       1           0 
pas13         1      0             1       0           0 
psdds2        0      0             0       1           0 
codedancer
  • 1,504
  • 9
  • 20

2 Answers2

1

You can use pd.Series.str.split and crosstab:

s = df.assign(fruit=df.fruit.str.split(", ")).explode("fruit")
print (pd.crosstab(s.id, s.fruit))

fruit   apple  dragon fruit  organge  others  watermelon
id                                                      
ab3e3       1             0        1       0           0
ccdf2       0             0        0       0           1
dsda1       0             0        0       1           0
pas13       0             1        1       0           0
psdds2      0             0        0       1           0
Henry Yik
  • 22,275
  • 4
  • 18
  • 40
1

Use .str.get_dummies() to get a dummy table from the fruit column. Then, join df with this dummy table, as follows:

df.drop('fruit', axis=1).join(df['fruit'].str.get_dummies(', '))

Result:

       id  apple  dragon fruit  organge  others  watermelon
0   ab3e3      1             0        1       0           0
1  psdds2      0             0        0       1           0
2   pas13      0             1        1       0           0
3   ccdf2      0             0        0       0           1
4   dsda1      0             0        0       1           0
SeaBean
  • 22,547
  • 3
  • 13
  • 25
  • Thanks @SeaBean. Always like your approach. I was going to ask you again about a solution you submitted in my question a while ago. There're some edge cases I cannot overcome and I'm wondering if you see there's a way to overcome. https://stackoverflow.com/questions/69604830/join-string-columns-and-remove-duplicates-and-others-in-pandas – codedancer Oct 17 '21 at 13:35
  • @codedancer I've posted an answer to your new post. Take a look :-) – SeaBean Oct 17 '21 at 14:15