2

I'm currently working with panel data in Python and I'm trying to compute the rolling average for each time series observation within a given group (ID).

Given the size of my data set (thousands of groups with multiple time periods), the .groupby and .apply() functions are taking way too long to compute (has been running over an hour and still nothing -- entire data set only contains around 300k observations).

I'm ultimately wanting to iterate over multiple columns, doing the following:

  1. Compute a rolling average for each time step in a given column, per group ID
  2. Create a new column containing the difference between the original value and the moving average [x_t - (x_t-1 + x_t)/2]
  3. Store column in a new DataFrame, which would be identical to original data set, except that it has the residual from #2 instead of the original value.
  4. Repeat and append new residuals to df_resid (as seen below)

    df_resid
    date        id   rev_resid   exp_resid
    2005-09-01   1         NaN         NaN
    2005-12-01   1      -10000       -5500
    2006-03-01   1     -352584   -262058.5
    2006-06-01   1      240000    190049.5 
    2006-09-01   1    82648.75    37724.25
    2005-09-01   2         NaN         NaN
    2005-12-01   2      4206.5       24353
    2006-03-01   2     -302574     -331951
    2006-06-01   2      103179    117405.5
    2006-09-01   2      -52650    -72296.5
    

Here's small sample of the original data.

df
date        id        rev        exp
2005-09-01   1   745168.0   545168.0    
2005-12-01   1   725168.0   534168.0    
2006-03-01   1    20000.0    10051.0
2006-06-01   1   500000.0   390150.0
2006-09-01   1   665297.5   465598.5
2005-09-01   2   956884.0   736987.0
2005-12-01   2   965297.0   785693.0
2006-03-01   2   360149.0   121791.0
2006-06-01   2   566507.0   356602.0
2006-09-01   2   461207.0   212009.0

And the (very slow) code:

df['rev_resid'] = df.groupby('id')['rev'].apply(lambda x:x.rolling(center=False,window=2).mean()) 

I'm hoping there is a much more computationally efficient way to do this (primarily with respect to #1), and could be extended to multiple columns.

Any help would be truly appreciated.

bosbraves
  • 65
  • 4
  • Hope the link can help https://stackoverflow.com/questions/13996302/python-rolling-functions-for-groupby-object – BENY Aug 03 '17 at 20:16

1 Answers1

3

To quicken up the calculation, if dataframe is already sorted on 'id' then you don't have to do rolling within a groupby (if it isn't sorted... do so). Then since your window is only length 2 then we mask the result by checking where id == id.shift This works because it's sorted.

d1 = df[['rev', 'exp']]
df.join(
    d1.rolling(2).mean().rsub(d1).add_suffix('_resid')[df.id.eq(df.id.shift())]
)

         date  id       rev       exp  rev_resid  exp_resid
0  2005-09-01   1  745168.0  545168.0        NaN        NaN
1  2005-12-01   1  725168.0  534168.0  -10000.00   -5500.00
2  2006-03-01   1   20000.0   10051.0 -352584.00 -262058.50
3  2006-06-01   1  500000.0  390150.0  240000.00  190049.50
4  2006-09-01   1  665297.5  465598.5   82648.75   37724.25
5  2005-09-01   2  956884.0  736987.0        NaN        NaN
6  2005-12-01   2  965297.0  785693.0    4206.50   24353.00
7  2006-03-01   2  360149.0  121791.0 -302574.00 -331951.00
8  2006-06-01   2  566507.0  356602.0  103179.00  117405.50
9  2006-09-01   2  461207.0  212009.0  -52650.00  -72296.50
piRSquared
  • 285,575
  • 57
  • 475
  • 624