I am trying to compute the rolling mean over the last n days(with n = 30) on a large dataset. In Pandas, I'd use the following command:
temp = chunk.groupby('id_code').apply(lambda x: x.set_index('entry_time_flat').resample('1D').first())
dd = temp.groupby(level=0)['duration'
].apply(lambda x: x.shift().rolling(min_periods = 1,window = n_days).mean()
).reset_index(name = "avg_delay_"+ str(n_days) + "_days")
chunk = pd.merge(chunk, dd, on=['entry_time_flat', 'id_code'], how='left'
).dropna(subset = ["avg_delay_"+ str(n_days) + "_days"])
Basically, the function groups by "id code" and, for the last n-days over "entry_time_flat" (a datetime object), computes the mean value of feature "duration".
However, in order to keep the code efficient, it would be great to reproduce this function on a Dask dataframe, without transforming it into a Pandas DF.
If I run the aforementioned code on a Dask DF, it raises the following error:
TypeError: __init__() got an unexpected keyword argument 'level'
Ultimately, how could I compute the mean of the "duration" column, over the last n-days on a Dask dataframe?