panda group on set column and get corresponding max value from another column

Question

My data is as follow :

 set_name |  cases | label

0 {a,b,c}    7        'a b c'
1 {j,i,a}    2        'j i a'
2 {a,c}      1        'a c'
3 {b,c,a}    2        'b c a'

i would like to groupby on set_name with sum on cases . but - i need the label to correspond to the row with max cases . so the output will look like :

 set_name |  cases | label

0 {a,b,c}    9        'a b c'
1 {j,i,a}    2        'j i a'
2 {a,c}      1        'a c'

Since both {a,b,c} and {b,c,a} are equal (they are set) i want to sum them up. i've tried using the solution suggested here but since groupby cant work on sets i cant proceed with it

jezrael · Accepted Answer · 2020-04-08T08:02:14.087

2

Becuase sets are unhashable one possible solution is use frozensets and aggregate by GroupBy.agg:

df1 = (df.sort_values(by='cases', ascending=False)
         .groupby(df['set_name'].apply(frozenset))
         .agg({'set_name': 'first',
               'cases':'sum',
               'label':'first'})
         .reset_index(drop=True))
print (df1)
    set_name  cases  label
0  {a, c, b}      9  a b c
1  {i, j, a}      2  j i a
2     {a, c}      1    a c

edited Apr 08 '20 at 08:02

answered Apr 08 '20 at 07:43

jezrael

822,522
95
1,334
1,252

1

i think i need to add df.sort_values(by='cases', ascending=False) before so the 'first" in agg will take the label correspond to the one with the max cases – Latent Apr 08 '20 at 08:01

score 1 · Answer 2 · answered Apr 08 '20 at 07:53

1

(
    df.sort_values(by='cases', ascending=False)
    .groupby(df.set_name.apply(sorted).apply(str))
    .agg(set_name=('set_name', 'first'),
         cases=('cases', 'sum'),
         lalbel=('label', 'first'),
        )
    .reset_index(drop=True)
)

answered Apr 08 '20 at 07:53

Allen Qin

19,507
8
51
67

panda group on set column and get corresponding max value from another column

2 Answers2