I have a dataframe, and I want to groupby
some attributes and calculate the rolling
mean of a numerical column in Dask. I know there is no implementation in Dask for groupby rolling
but I read an SO question which shows it was possible.
Dask rolling function by group syntax
When I am using the same syntax from the post, I get an error :
UnpicklingError: invalid load key, '�'.
I do not understand why I am getting an unpickling error.
df.groupby(by=path)[metric].apply(lambda df_g: df_g[metric].rolling(5).mean(), meta=(metric, 'f8')).compute()
where path
is a list of attribute columns, and metric
is the numeric column.
I also tried the following:
def moving_avg(partition):
return partition.rolling(5).mean()
df.groupby(by=path)[metric].apply(moving_avg, meta='f8').compute()
I use the rolling average function in Pyspark where I define the partitions by groupby and then roll a window over it.
Sample data :
CATEGORY_NAME MKT ... Growth Sales
Date ...
2017-01-07 TP SIMS ... 0.0000 17280
2017-01-07 TP TOPRITE ... -0.4566 1825
2017-01-07 TP GIANT HYPER ... 0.0874 18417
2017-01-07 TP GIANT HYPER ... -0.1359 10914
2017-01-07 TP GIANT HYPER ... 0.2245 4422
2017-01-07 TP TOPRITE ... 0.1084 1444
2017-01-07 TP GIANT HYPER ... 0.0542 18412
2017-01-07 TP FENCER ... 0.2766 25184
2017-01-07 TP GIANT HYPER ... -0.0572 19466
2017-01-07 TP TOPRITE ... 0.1795 1503
2017-01-07 TP GIANT HYPER ... 0.0770 13615
Say I want to groupby ["CATEGORY_NAME", "MKT"]
and take a rolling average of Sales
.