Complex / multi-column operations on groupby

Question

I would like to do multicolumn operations (ie correlate below) as well as operations that use results on previous calculations (ie diff calculation below) without using a for loop and using native pandas functions like groupby and agg. Is this possible?

import pandas as pd
import datetime
import numpy as np

np.random.seed(0)
df = pd.DataFrame({'date': [datetime.datetime(2010,1,1)+datetime.timedelta(days=i*15) 
                            for i in range(0,100)],
                   'invested': np.random.random(100)*1e6,
                   'return': np.random.random(100),
                   'side': np.random.choice([-1, 1], 100)})

df['year'] = df['date'].apply(lambda x: x.year)

# want to get rid of the for loop below
ret_year = []
for year in list(list(df['year'].unique())):
    df_this_year = df[df['year'] == year]
    min_short = df_this_year[df_this_year['side'] == -1]['return'].max()
    min_long = df_this_year[df_this_year['side'] == -1]['return'].min()
    min_diff = min_long - min_short
    avg_inv = df_this_year['invested'].mean()
    corr = np.correlate(df_this_year['invested'], df_this_year['return'])[0]
    ret_year.append({'year': year, 'min_short': min_short, 'min_long': min_long,
                     'min_diff': min_diff, 'avg_inv': avg_inv, 'corr': corr})

print(pd.DataFrame(ret_year))

Result:

         avg_inv          corr  min_diff  min_long  min_short  year
0  590766.254452  8.821215e+06 -0.664752  0.297437   0.962189  2010
1  490224.532564  6.122306e+06 -0.900289  0.019193   0.919483  2011
2  438330.806563  4.768964e+06 -0.929680  0.069167   0.998847  2012
3  373038.880789  4.677380e+06 -0.779678  0.164694   0.944372  2013
4  416817.752705  5.014249e+04  0.000000  0.434417   0.434417  2014

Here are some similar questions but not quite the same:

score 3 · Accepted Answer · answered Jan 19 '17 at 15:27

Instead of iterating through a for loop, take advantage of pandas groupby + apply. You can make things easier on yourself by putting the date column into the index and group by year with pd.TimeGrouper('A') - 'A' is the pandas date offset alias for year.

def calculate(x):
    min_short = x.loc[x['side'] == -1, 'return'].max()
    min_long = x.loc[x['side'] == -1, 'return'].min()
    min_diff = min_long - min_short
    avg_inv = x['invested'].mean()
    corr = np.correlate(x['invested'], x['return'])[0]
    return pd.Series([avg_inv, corr, min_diff, min_long, min_short], 
                     index=['avg_inv','corr','min_diff','min_long','min_short'])

df.groupby(pd.TimeGrouper('A')).apply(calculate).to_period('A')


            avg_inv          corr  min_diff  min_long  min_short
date                                                            
2010  590766.254452  8.821215e+06 -0.664752  0.297437   0.962189
2011  490224.532564  6.122306e+06 -0.900289  0.019193   0.919483
2012  438330.806563  4.768964e+06 -0.929680  0.069167   0.998847
2013  373038.880789  4.677380e+06 -0.779678  0.164694   0.944372
2014  416817.752705  5.014249e+04  0.000000  0.434417   0.434417

Nice, thanks for the answer. So `agg` would not be sufficient / relevant here because I'm using multiple columns in some of the calcs? — ashishsingal, Jan 19 '17 at 15:38
Yes, correct. `agg` only operates on one column at a time. It sends a Series to the calling function. — Ted Petrou, Jan 19 '17 at 15:39

Complex / multi-column operations on groupby

1 Answers1