How to use Pandas rolling average without guaranteed number of observations

Question

I am looking at annualized baseball statistics and would like to calculate a rolling mean looking back at the previous 3 years' worth of performance in regard to number of Hits. However, I want to account for the fact that while my dataset reaches back more than 3 years, one single player may have only been in the league for 1-2 years and will not have 3 years' worth of observations off of which I can calculate the rolling mean. For example:

In[6]: df = pd.DataFrame({'PLAYER_ID': ['A', 'A', 'A', 'B', 'B'], 
                          'HITS': [45, 55, 50, 20, 24]})
In[9]: df
Out[9]: 
  PLAYER_ID  HITS
0         A    45
1         A    55
2         A    50
3         B    20
4         B    24

How would I use a groupby and aggregation/transform (or some other process) to calculate the rolling mean for each player with a max 3 years historic totals and then just use the max available historic observations for a player with less than 3 years' historic performance data available?

Pretty sure my answer lies within the Pandas package but would be interested in any solution.

Thanks!

Are you sure a rolling mean is actually what you want? With only one to three observations per player, this data is too sparse for a rolling mean to be particularly meaningful. — Aleksey Bilogur, Jan 28 '17 at 18:39
It was annualized for simplicity of explanation. I'll be doing this on game level data along with some other things. — KidMcC, Jan 28 '17 at 18:43
What's your [sampling frequency](http://pandas.pydata.org/pandas-docs/stable/timeseries.html#resampling)? — Aleksey Bilogur, Jan 28 '17 at 18:45
Does sampling frequency impact how I ensure I only look back so far as I have available data for every Player ID? My example was for simplicity of explanation. — KidMcC, Jan 28 '17 at 19:01
Another discussion you may find interesting is here: https://stackoverflow.com/q/15771472 — blong, Oct 12 '20 at 17:51

score 2 · Accepted Answer · answered Jan 28 '17 at 20:07

pd.DataFrame.rolling handles this problem for you automatically. Using your example data, df.groupby('PLAYER_ID').rolling(1).mean() will give you:

             HITS PLAYER_ID
PLAYER_ID
A         0  45.0         A
          1  55.0         A
          2  50.0         A
B         3  20.0         B
          4  24.0         B

For your example case I'm using a window size of just 1, which means that we're treating each individual observation as its own mean. This isn't particularly interesting. With more data you can use a larger window size: for example, if your data is weekly, rolling(5) would give you an approximately monthly window size (or rolling(31) if your data is daily, and so on).

Two issues to be aware of when using this methodology:

If your data isn't sampled on a regular basis (e.g. if it skips a week or a month at a time), your rolling average won't be aligned in time. For this reason if your data isn't already regularly sampled you'll usually want to resample it.
If your data contains NaN values, those will be propagated: every window containing that NaN will also be NaN. You'll have to impute those values somehow to keep that from happening.

Can't say this works for me. When I try this I get an AttributeError "Cannot access callable attribute 'rolling' of 'DataFrameGroupby' object. I can, however, do this: `In[24]: df_test.sort_values(by=['PLAYER_ID', 'HITS'], ascending=True, inplace=True) In[25]: df_test = df_test.groupby(['PLAYER_ID'], as_index=False).agg({'HITS': {'Rolling Avg HITS': lambda x: x[-3:].mean()}}) In[26]: df_test Out[26]: PLAYER_ID HITS Rolling Avg HITS 0 A 50 1 B 22` So long as its sorted beforehand won't this do the same? — KidMcC, Jan 29 '17 at 00:07
What version of `pandas` are you using? The `rolling` object was introduced relatively recently, before that you would use `rolling_mean(1)` instead, so try that. As for the workaround: yes, that might work in this specific instance, but I'm not sure how well it would generalize. — Aleksey Bilogur, Jan 29 '17 at 00:20
Pandas 0.18.0 So that could be it. And thanks again for the help. — KidMcC, Jan 29 '17 at 00:21

How to use Pandas rolling average without guaranteed number of observations

1 Answers1