3

I have a dataframe where I take the subset of only numeric columns, calculate the 5 day rolling average for each numeric column and add it as a new column to the df.

This approach works but currently takes quite a long time (8 seconds per column). I'm wondering if there is a better way to do this.

A working toy example of what I'm doing currently:

data = {'Group': ['A','A','A','A','A','A','B','B','B','B','B','C','C','C','C','C'],
        'Year' : ['2017', '2017', '2017', '2018', '2018', '2018', '2017', '2017', '2018', '2018', '2017', '2017', '2017', '2017', '2018', '2018'],
        'Score 1' : [1,2,3,4,5,6,1,2,3,4,5,1,2,3,4,5],
       'Score 2': [1,4,5,7,5,5,6,1,4,5,6,7,4,6,4,6]}
df = pd.DataFrame(data)

for col in ['Score 1', 'Score 2']:
    df[col + '_avg'] = df.groupby(['Year', 'Group'])[col].apply(lambda x: x.rolling(2, 1).mean().shift().bfill())
Malik Asad
  • 441
  • 4
  • 15
ctd25
  • 730
  • 1
  • 11
  • 22
  • 1
    The `gropby` function is very "expensive".. try to use [numpy for that](https://stackoverflow.com/questions/38013778/is-there-any-numpy-group-by-function) – DavidDr90 Feb 18 '19 at 14:44

1 Answers1

0

For anyone who lands on this, I was able to speed this up significantly by sorting first and avoiding the lambda function:

return_df[col + '_avg'] = df.sort_values(['Group', 'Year']).groupby(['Group'])[col].rolling(2,1).mean().shift().values
ctd25
  • 730
  • 1
  • 11
  • 22