Distinct aggregation over multiple columns in Pandas based on column names

Question

I'm having a large Pandas data frame and I want to aggregate the columns differently. I have 24 columns (hours of the day), which I would like to sum, and for all others just take the maximum.

I know that I can write manually the required conditions like this:

df_agg = df.groupby('user_id').agg({'hour_0':'sum', 
                                    'hour_1':'sum',
                                    .
                                    .
                                     'hour_24':'sum',
                                    'all other columns': 'max'}
 )

but I was wondering whether an elegant solution exists on the lines:

df_agg = df.groupby('user_id').agg({'hour_*':'sum', 
                                    'all other columns != hour_*': 'max'}

jezrael · Accepted Answer · 2020-07-20T09:00:57.373

2

You can generate dictionary by all columns with hour, add all another columns to another dictionary, merge them and last pass to agg:

c1 = df.columns[df.columns.str.startswith('hour')].tolist()
#also excluded user_id column for avoid `max` aggregation
c2 = df.columns.difference(c1 + ['user_id'])
#https://stackoverflow.com/a/26853961
d = {**dict.fromkeys(c1, 'sum'), **dict.fromkeys(c2, 'max')}

df_agg = df.groupby('user_id').agg(d)

Or you can use 2 times groupby with concat:

df_agg = pd.concat([df.groupby('user_id')[c1].sum(), 
                    df.groupby('user_id')[c2].max()], axis=1)

edited Jul 20 '20 at 09:00

answered Jul 20 '20 at 08:54

jezrael

822,522
95
1,334
1,252

1

This is exactly what I was looking for! Especially the second solution seems brief and very easy to understand at a quick glance. You're amazing as usual! :) – mannaroth Jul 20 '20 at 09:12

Distinct aggregation over multiple columns in Pandas based on column names

1 Answers1