Drop duplicates on string based columns but aggregate relevant columns with values

Question

The data has duplicated values in column H1 & H2. Based on the combinations, I want to remove the duplicates but aggregate the values in column V1 (sum) & V2 (mean).

df_p = pd.DataFrame({'H1':['A1','A1','C1','B1','A1','C1'],
                     'H2':['X1','Y1','Z1','X1','Y1','Z1'],
                     'V1':[11,7,8,4,12,6],
                     'V2':[110,70,80,40,120,60]})
df_p.drop_duplicates(['H1','H2'])

The output is as follows, which ofcourse doesn't aggregate the values as I want.

    H1  H2  V1  V2
0   A1  X1  11  110
1   A1  Y1  7   70
2   C1  Z1  8   80
3   B1  X1  4   40

What I want is

    H1  H2  V1  V2
0   A1  X1  11  110
1   A1  Y1  19  95
2   C1  Z1  14  70
3   B1  X1  4   40

I know groupby/ pivottable could be options. But I have 50+ columns with 20+ str combinations & 30+ value columns that need to be aggregated. I don't know an effective way to process that without selectively going by each of them.

If you're aggregating, then you're not dropping duplicates. Without more details about what you want to do, we can't be sure of giving it, but this probably helps: https://stackoverflow.com/questions/14529838/apply-multiple-functions-to-multiple-groupby-columns — Acccumulation, Dec 21 '22 at 03:41

Osca · Answer 1 · 2022-12-22T02:30:27.810

0

You can combine groupby and aggregate. aggregate lets you do aggregation on multiple columns.

df_p.groupby(['H1', 'H2']).agg({'V1':'sum', 'V2':'mean'}).reset_index()

This returns the output you want.

Edit:

use fromkeys to pass a list of columns to aggregation

cols = ['V1', 'V2']
d = dict.fromkeys(cols,'sum') 
df.groupby(['H1', 'H2']).agg(d)

edited Dec 22 '22 at 02:30

answered Dec 21 '22 at 03:45

Osca

1,588
2
20
41

I know that's what I can do. But as I said, imagine in my data there are 20+ columns like H1 & 30+ columns like V1. It will be quite a task to define each of them . Is there a way I can pass a list in the key of the aggregating dictionary? like this: df_p.groupby(['H1','H2']).agg({['V1',"V2"]:'sum'}). That could make my job easier. – Rajib Lochan Sarkar Dec 21 '22 at 03:51
1

@RajibLochanSarkar The similar question was asked before https://stackoverflow.com/questions/67717440/use-a-list-of-column-names-in-groupby-agg – Osca Dec 21 '22 at 22:51
1

Thanks for the link - quite useful. However, your formkeys have not been applied correctly. As d was never called in the final aggregation & the dictionary couldn't capture the 'sum' value. It should have been like this: d = dict.fromkeys(cols,'sum') df_p.groupby(['H1', 'H2']).agg(d) – Rajib Lochan Sarkar Dec 22 '22 at 02:05
@RajibLochanSarkar good catch, I missed that bit – Osca Dec 22 '22 at 02:31

Drop duplicates on string based columns but aggregate relevant columns with values

1 Answers1