1

I have a time series of 10 years of days, with five columns of data. Here is code to randomly generate:

import numpy as np
import pandas as pd
np.random.seed(1)

rows,cols = 4950,5
data = np.random.rand(rows,cols) 
tidx = pd.date_range('2010-01-01', periods=rows, freq='D') 
df = pd.DataFrame(data, columns=['a','b','c','d','e'], index=tidx)
print (df)

For each day in the time series I would like to calculate the mean for the same day in the PREVIOUS five years. For example for the date 1/1/2023: take the average of 1/1/2022, 1/1/2021, 1/1/2020, 1/1/2019 and 1/1/2018. I've tried many variations on:

df.groupby([df.index.month, df.index.day]).mean()

But I can't get it use just the five years prior to the index.year in the mean calculation for each row. I would like to add five columns with the "rolling" average, and name each column for example "A_Avg"

thesimplevoodoo
  • 153
  • 3
  • 11
  • Are you aware of [`rolling`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html)? I think you want `.groupby(...).rolling(5).mean()`, but I'm relatively new to Pandas myself. – wjandrea Aug 04 '23 at 00:46
  • It'd help if you posted a [mre] so we could check the answers. For specifics, see [How to make good reproducible pandas examples](/q/20109391/4518341). For minimization, I think you could go as low as 1 column, 3 years of data, and a 2-year rolling window. – wjandrea Aug 04 '23 at 01:13

3 Answers3

1

From your DataFrame, we can start by convert the Date column as Datetime :

df['Date'] = pd.to_datetime(df['Date'], format="%d/%m/%Y")

Then we separate the year, month and year in columns :

df['year'] = df['Date'].dt.year
df['month'] = df['Date'].dt.month
df['day'] = df['Date'].dt.day

Next step, we reorder the row depending on day, month and year and reindex the DataFrame accordingly like so :

df = df.sort_values(by=['day', 'month', 'year']).reset_index(drop=True)

To finish, we can use the rolling method setup on 5 to get the expected result on each columns :

df['A_Avg'] = df.groupby(['day', 'month'], as_index=False)['A'].rolling(5).mean()['A']
df['B_Avg'] = df.groupby(['day', 'month'], as_index=False)['B'].rolling(5).mean()['B']
df['C_Avg'] = df.groupby(['day', 'month'], as_index=False)['C'].rolling(5).mean()['C']
df['D_Avg'] = df.groupby(['day', 'month'], as_index=False)['D'].rolling(5).mean()['D']
df['E_Avg'] = df.groupby(['day', 'month'], as_index=False)['E'].rolling(5).mean()['E']
tlentali
  • 3,407
  • 2
  • 14
  • 21
  • Thanks. It appears I was having issues w my code due to the index in my code having a name. If i run: ```df.index.name = None``` Then resulting df retains the date column. Not clear to me what the reason for this is – thesimplevoodoo Aug 07 '23 at 22:00
1

It sounds like you want rolling.

mean_5y_prev = df.groupby([df.index.month, df.index.day]).rolling(5).mean()

Then you can join back to df:

df.join(mean_5y_prev.droplevel([0, 1]), rsuffix='_Avg')
wjandrea
  • 28,235
  • 9
  • 60
  • 81
  • This first line produces the right averages, but there is a 2d multiindex. So you can't drop both levels in the join. The data returned by first line is not sorted in the same order as the original time series (chronologically), so you can't append. Can you suggest something to sort line 1 or to join with the multi-index data? – thesimplevoodoo Aug 04 '23 at 03:00
  • @thesimplevoodoo I'm not following. Why can't you drop levels 0 and 1? Those are the month and day added by the `groupby`, then you're left with the date added by the `rolling`, and you can simply join on the index of the df, which is the `Date`. If you're getting an error or something, please make an MRE like I suggested in a comment on the question. It works for me with a df I made up, but now I'm not sure I understand your data structure. – wjandrea Aug 04 '23 at 04:50
  • Whoops, the `rolling` doesn't add an index level. That's just the original index. – wjandrea Aug 04 '23 at 05:13
  • I tried to include this data but it exceeds the character limite by 2x – thesimplevoodoo Aug 07 '23 at 21:28
  • ok, its not clear to me what happened. in my data, the index column had a name: "Date". if i use this first: ```df.index.name = None``` Then your code works perfectly. But otherwise it removes the dates in the original "dates" index column. Not sure why – thesimplevoodoo Aug 07 '23 at 21:58
0

Create dataframe.

df = pd.DataFrame({
    'Date': ['1/1/2023', '1/2/2023', '1/3/2023', '1/1/2022', '1/2/2022', '1/3/2022', '1/1/2015', '1/2/2015', '1/3/2015'],
    'A': [100, 105, 110, 100, 105, 110, 100, 105, 110],
    'B': [200, 205, 210, 200, 205, 210, 200, 205, 210],
    'C': [300, 305, 310, 300, 305, 310, 300, 305, 310],
    'D': [400, 405, 410, 400, 405, 410, 400, 405, 410],
    'E': [500, 505, 510, 500, 505, 510, 500, 505, 510],
})
df['Date'] = pd.to_datetime(df['Date'], format='%m/%d/%Y')
df['year'] = df['Date'].dt.year
df['month'] = df['Date'].dt.month
df['day'] = df['Date'].dt.day

Get a list of years which can be used to cut the dataframe.

bins = [y for y in range(df['year'].max(), df['year'].min() - 5, -5)]
bins.reverse()
# [2013, 2018, 2023]

Calculate the mean of groups.

df['year'] = pd.cut(df['year'], bins)
df = df.groupby(['year', 'month', 'day'], as_index=False).mean(numeric_only=True)

output:

           year  month  day      A      B      C      D      E
0  (2013, 2018]      1    1  100.0  200.0  300.0  400.0  500.0
1  (2013, 2018]      1    2  105.0  205.0  305.0  405.0  505.0
2  (2013, 2018]      1    3  110.0  210.0  310.0  410.0  510.0
3  (2018, 2023]      1    1  100.0  200.0  300.0  400.0  500.0
4  (2018, 2023]      1    2  105.0  205.0  305.0  405.0  505.0
5  (2018, 2023]      1    3  110.0  210.0  310.0  410.0  510.0
Shuo
  • 1,512
  • 1
  • 3
  • 13