My issue
In SQL it is very easy to apply different aggregate functions to different columns, e.g. :
select item, sum(a) as [sum of a], avg(b) as [avg of b], min(c) as [min of c]
In pandas, not so much. The solution provided here became deprecated:
df.groupby('qtr').agg({"realgdp": {"mean_gdp": "mean", "std_gdp": "std"},
"unemp": {"mean_unemp": "mean"}})
My solution
The least worst solution I have managed to find, mostly based on other stack overflow questions I can no longer find, is something like the toy example at the bottom, where I:
- define a function with all the calculations I need
- calculate each column separately, then put them together in a dataframe
- apply the function as a lambda function:
What I would like to improve: naming columns
If you only have 2 or 3 columns to create, this solution is great.
However, if you have many columns to calculate, naming them becomes fiddly and very error-prone: I have to create a list with the column names, and pass that list as the index of the dataframe created by the function.
Now imagine I already have 12 columns and need to add 3 more; there's a chance I may make some confusion and add the corresponding column names in the wrong order.
Compare this with SQL, where you assign the name right after defining the calculation - the difference is night and day.
Is there a better way? E.g. a way to assign the name of the column at the same time I define the calculation?
Why this is not a duplicate question
The focus of the question is specifically on how to name the columns so as to minimise the risk of errors and confusion. There are somewhat similar questions based on now deprecated functionalities of pandas, or with answers which provide an automatic naming of the columns but, to my knowledge, no question which focuses on this very point.
Toy example
import pandas as pd
import numpy as np
df = pd.DataFrame(columns =['a','b','c','d'], data = np.random.rand(300,4))
df['city'] = np.repeat(['London','New York','Buenos Aires'], 100)
def func(x, df):
# func() gets called within a lambda function; x is the row, df is the entire table
b1 = x['a'].sum()
b2 = x['a'].sum() / df['a'].sum() if df['a'].sum() !=0 else np.nan
b3 = x['b'].mean()
b4 = ( x['a'] * x['b']).sum() / x['b'].sum() if x['b'].sum() >0 else np.nan
b5 = x['c'].sum()
b6 = x['d'].sum()
cols = ['sum of a',
'% of a',
'avg of b',
'weighted avg of a, weighted by b',
'sum of c',
'sum of d']
return pd.Series( [b1, b2, b3, b4, b5, b6] , index = cols )
out = df.groupby('city').apply(lambda x: func(x,df))