1

My data is as follow :

 set_name |  cases | label

0 {a,b,c}    7        'a b c'
1 {j,i,a}    2        'j i a'
2 {a,c}      1        'a c'
3 {b,c,a}    2        'b c a'

i would like to groupby on set_name with sum on cases . but - i need the label to correspond to the row with max cases . so the output will look like :

 set_name |  cases | label

0 {a,b,c}    9        'a b c'
1 {j,i,a}    2        'j i a'
2 {a,c}      1        'a c'

Since both {a,b,c} and {b,c,a} are equal (they are set) i want to sum them up. i've tried using the solution suggested here but since groupby cant work on sets i cant proceed with it

Latent
  • 556
  • 1
  • 9
  • 23

2 Answers2

2

Becuase sets are unhashable one possible solution is use frozensets and aggregate by GroupBy.agg:

df1 = (df.sort_values(by='cases', ascending=False)
         .groupby(df['set_name'].apply(frozenset))
         .agg({'set_name': 'first',
               'cases':'sum',
               'label':'first'})
         .reset_index(drop=True))
print (df1)
    set_name  cases  label
0  {a, c, b}      9  a b c
1  {i, j, a}      2  j i a
2     {a, c}      1    a c
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 1
    i think i need to add df.sort_values(by='cases', ascending=False) before so the 'first" in agg will take the label correspond to the one with the max cases – Latent Apr 08 '20 at 08:01
1
(
    df.sort_values(by='cases', ascending=False)
    .groupby(df.set_name.apply(sorted).apply(str))
    .agg(set_name=('set_name', 'first'),
         cases=('cases', 'sum'),
         lalbel=('label', 'first'),
        )
    .reset_index(drop=True)
)
Allen Qin
  • 19,507
  • 8
  • 51
  • 67