2

I am trying to compute the rolling mean over the last n days(with n = 30) on a large dataset. In Pandas, I'd use the following command:

 temp = chunk.groupby('id_code').apply(lambda x: x.set_index('entry_time_flat').resample('1D').first())
    dd = temp.groupby(level=0)['duration'
                                ].apply(lambda x: x.shift().rolling(min_periods = 1,window = n_days).mean()
                                        ).reset_index(name = "avg_delay_"+ str(n_days) + "_days")

    chunk = pd.merge(chunk, dd, on=['entry_time_flat', 'id_code'], how='left'
                     ).dropna(subset = ["avg_delay_"+ str(n_days) + "_days"])

Basically, the function groups by "id code" and, for the last n-days over "entry_time_flat" (a datetime object), computes the mean value of feature "duration".

However, in order to keep the code efficient, it would be great to reproduce this function on a Dask dataframe, without transforming it into a Pandas DF.

If I run the aforementioned code on a Dask DF, it raises the following error:

TypeError: __init__() got an unexpected keyword argument 'level'

Ultimately, how could I compute the mean of the "duration" column, over the last n-days on a Dask dataframe?

Alessandro Ceccarelli
  • 1,775
  • 5
  • 21
  • 41

1 Answers1

0

Ultimately, how could I compute the mean of the "duration" column, over the last n-days on a Dask dataframe?

The rolling API should give you this functionality

https://docs.dask.org/en/latest/dataframe-api.html#rolling

MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • `ddf.groupby('g')['v'].apply(lambda x: x.rolling('3D').mean(), meta=('avg3d', 'f8'))` this worked but I can't get it back into the original frame, any pointers? see full example https://stackoverflow.com/questions/70867704/dask-calculate-groupby-rolling-mean-over-the-last-n-days-and-assign-to-original – citynorman Jan 26 '22 at 17:27