Groupby - rolling alternative in Dask

Question

I am trying to implement a rolling average which resets whenever a '1' is encountered in a column labeled 'A'.

For example, the following functionality works in Pandas.

import pandas as pd

x = pd.DataFrame([[0,2,3], [0,5,6], [0,8,9], [1,8,9],[0,8,9],[0,8,9], [0,3,5], [1,8,9],[0,8,9],[0,8,9], [0,3,5]])
x.columns = ['A', 'B', 'C']

x['avg'] = x.groupby(x['A'].cumsum())['B'].rolling(2).mean().values

If I try an analogous code in Dask, I get the following:

import pandas as pd
import dask

x = pd.DataFrame([[0,2,3], [0,5,6], [0,8,9], [1,8,9],[0,8,9],[0,8,9], [0,3,5], [1,8,9],[0,8,9],[0,8,9], [0,3,5]])
x.columns = ['A', 'B', 'C']

x = dask.dataframe.from_pandas(x, npartitions=3)

x['avg'] = x.groupby(x['A'].cumsum())['B'].rolling(2).mean().values
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-189-b6cd808da8b1> in <module>()
      7 x = dask.dataframe.from_pandas(x, npartitions=3)
      8 
----> 9 x['avg'] = x.groupby(x['A'].cumsum())['B'].rolling(2).mean().values
     10 x

AttributeError: 'SeriesGroupBy' object has no attribute 'rolling'

After searching through the Dask API documentation I have not been able to find an implementation of what I am looking for.

Can anyone suggest an implementation of this algorithm in a Dask compatible way?

Thank you :)

Since then I found the following code snippet:

df1 = ddf.groupby('cumsum')['x'].apply(lambda x: x.rolling(2).mean(), meta=('x', 'f8')).compute()

at Dask rolling function by group syntax.

Here is an adapted toy example:

import pandas as pd
import dask.dataframe as dd

x = pd.DataFrame([[1,2,3], [2,3,4], [4,5,6], [2,3,4], [4,5,6],  [4,5,6], [2,3,4]])
x['bool'] = [0,0,0,1,0,1,0]
x.columns = ['a', 'b', 'x', 'bool']

ddf = dd.from_pandas(x, npartitions=4)
ddf['cumsum'] = ddf['bool'].cumsum()

df1 = ddf.groupby('cumsum')['x'].apply(lambda x: x.rolling(2).mean(), meta=('x', 'f8')).compute()
df1

This has the correct functionality, but the order of the indices is now incorrect. Alternatively, if one knows how to preserve the order of the index, that would be a suitable solution.

score 0 · Answer 1 · answered Mar 27 '19 at 05:00

0

You might want to construct your own rolling operation using the map_overlap or the _cum_agg methods (cum_agg is unfortunately not well documented).

answered Mar 27 '19 at 05:00

MRocklin

55,641
23
163
235

Groupby - rolling alternative in Dask

1 Answers1