2

I'm having a large Pandas data frame and I want to aggregate the columns differently. I have 24 columns (hours of the day), which I would like to sum, and for all others just take the maximum.

I know that I can write manually the required conditions like this:

df_agg = df.groupby('user_id').agg({'hour_0':'sum', 
                                    'hour_1':'sum',
                                    .
                                    .
                                     'hour_24':'sum',
                                    'all other columns': 'max'}
 )

but I was wondering whether an elegant solution exists on the lines:

df_agg = df.groupby('user_id').agg({'hour_*':'sum', 
                                    'all other columns != hour_*': 'max'}
mannaroth
  • 1,473
  • 3
  • 17
  • 38

1 Answers1

2

You can generate dictionary by all columns with hour, add all another columns to another dictionary, merge them and last pass to agg:

c1 = df.columns[df.columns.str.startswith('hour')].tolist()
#also excluded user_id column for avoid `max` aggregation
c2 = df.columns.difference(c1 + ['user_id'])
#https://stackoverflow.com/a/26853961
d = {**dict.fromkeys(c1, 'sum'), **dict.fromkeys(c2, 'max')}

df_agg = df.groupby('user_id').agg(d)

Or you can use 2 times groupby with concat:

df_agg = pd.concat([df.groupby('user_id')[c1].sum(), 
                    df.groupby('user_id')[c2].max()], axis=1)
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 1
    This is exactly what I was looking for! Especially the second solution seems brief and very easy to understand at a quick glance. You're amazing as usual! :) – mannaroth Jul 20 '20 at 09:12