0

The data has duplicated values in column H1 & H2. Based on the combinations, I want to remove the duplicates but aggregate the values in column V1 (sum) & V2 (mean).

df_p = pd.DataFrame({'H1':['A1','A1','C1','B1','A1','C1'],
                     'H2':['X1','Y1','Z1','X1','Y1','Z1'],
                     'V1':[11,7,8,4,12,6],
                     'V2':[110,70,80,40,120,60]})
df_p.drop_duplicates(['H1','H2'])

The output is as follows, which ofcourse doesn't aggregate the values as I want.

    H1  H2  V1  V2
0   A1  X1  11  110
1   A1  Y1  7   70
2   C1  Z1  8   80
3   B1  X1  4   40

What I want is

    H1  H2  V1  V2
0   A1  X1  11  110
1   A1  Y1  19  95
2   C1  Z1  14  70
3   B1  X1  4   40

I know groupby/ pivottable could be options. But I have 50+ columns with 20+ str combinations & 30+ value columns that need to be aggregated. I don't know an effective way to process that without selectively going by each of them.

  • If you're aggregating, then you're not dropping duplicates. Without more details about what you want to do, we can't be sure of giving it, but this probably helps: https://stackoverflow.com/questions/14529838/apply-multiple-functions-to-multiple-groupby-columns – Acccumulation Dec 21 '22 at 03:41

1 Answers1

0

You can combine groupby and aggregate. aggregate lets you do aggregation on multiple columns.

df_p.groupby(['H1', 'H2']).agg({'V1':'sum', 'V2':'mean'}).reset_index()

This returns the output you want.

Edit:

use fromkeys to pass a list of columns to aggregation

cols = ['V1', 'V2']
d = dict.fromkeys(cols,'sum') 
df.groupby(['H1', 'H2']).agg(d)
Osca
  • 1,588
  • 2
  • 20
  • 41
  • I know that's what I can do. But as I said, imagine in my data there are 20+ columns like H1 & 30+ columns like V1. It will be quite a task to define each of them . Is there a way I can pass a list in the key of the aggregating dictionary? like this: df_p.groupby(['H1','H2']).agg({['V1',"V2"]:'sum'}). That could make my job easier. – Rajib Lochan Sarkar Dec 21 '22 at 03:51
  • 1
    @RajibLochanSarkar The similar question was asked before https://stackoverflow.com/questions/67717440/use-a-list-of-column-names-in-groupby-agg – Osca Dec 21 '22 at 22:51
  • 1
    Thanks for the link - quite useful. However, your formkeys have not been applied correctly. As d was never called in the final aggregation & the dictionary couldn't capture the 'sum' value. It should have been like this: d = dict.fromkeys(cols,'sum') df_p.groupby(['H1', 'H2']).agg(d) – Rajib Lochan Sarkar Dec 22 '22 at 02:05
  • @RajibLochanSarkar good catch, I missed that bit – Osca Dec 22 '22 at 02:31