I'm currently working with panel data in Python and I'm trying to compute the rolling average for each time series observation within a given group (ID).
Given the size of my data set (thousands of groups with multiple time periods), the .groupby and .apply() functions are taking way too long to compute (has been running over an hour and still nothing -- entire data set only contains around 300k observations).
I'm ultimately wanting to iterate over multiple columns, doing the following:
- Compute a rolling average for each time step in a given column, per group ID
- Create a new column containing the difference between the original value and the moving average [x_t - (x_t-1 + x_t)/2]
- Store column in a new DataFrame, which would be identical to original data set, except that it has the residual from #2 instead of the original value.
Repeat and append new residuals to df_resid (as seen below)
df_resid date id rev_resid exp_resid 2005-09-01 1 NaN NaN 2005-12-01 1 -10000 -5500 2006-03-01 1 -352584 -262058.5 2006-06-01 1 240000 190049.5 2006-09-01 1 82648.75 37724.25 2005-09-01 2 NaN NaN 2005-12-01 2 4206.5 24353 2006-03-01 2 -302574 -331951 2006-06-01 2 103179 117405.5 2006-09-01 2 -52650 -72296.5
Here's small sample of the original data.
df
date id rev exp
2005-09-01 1 745168.0 545168.0
2005-12-01 1 725168.0 534168.0
2006-03-01 1 20000.0 10051.0
2006-06-01 1 500000.0 390150.0
2006-09-01 1 665297.5 465598.5
2005-09-01 2 956884.0 736987.0
2005-12-01 2 965297.0 785693.0
2006-03-01 2 360149.0 121791.0
2006-06-01 2 566507.0 356602.0
2006-09-01 2 461207.0 212009.0
And the (very slow) code:
df['rev_resid'] = df.groupby('id')['rev'].apply(lambda x:x.rolling(center=False,window=2).mean())
I'm hoping there is a much more computationally efficient way to do this (primarily with respect to #1), and could be extended to multiple columns.
Any help would be truly appreciated.