4

I'd like to apply rolling functions to a dataframe grouped by two columns with repeated date entries. Specifically, with both "freq" and "window" as datetime values, not simply ints.

In principle, I'm try to combine the methods from How to apply rolling functions in a group by object in pandas and pandas rolling sum of last five minutes.

Input

Here is a sample of the data, with one id=33 although we expect several id's.

X = [{'date': '2017-02-05', 'id': 33, 'item': 'A', 'points': 20},
 {'date': '2017-02-05', 'id': 33, 'item': 'B', 'points': 10},
 {'date': '2017-02-06', 'id': 33, 'item': 'B', 'points': 10},
 {'date': '2017-02-11', 'id': 33, 'item': 'A', 'points': 1},
 {'date': '2017-02-11', 'id': 33, 'item': 'A', 'points': 1},
 {'date': '2017-02-11', 'id': 33, 'item': 'A', 'points': 1},
 {'date': '2017-02-13', 'id': 33, 'item': 'A', 'points': 4}]

# df = pd.DataFrame(X) and reindex df to pd.to_datetime(df['date'])

df

            id item  points
date                       
2017-02-05  33    A      20
2017-02-05  33    B      10
2017-02-06  33    B      10
2017-02-11  33    A       1
2017-02-11  33    A       1
2017-02-11  33    A       1
2017-02-13  33    A       4

Goal

Sample each 'id' every 2 days (freq='2d') and return the sum of total points for each item over the previous three days (window='3D'), end-date inclusive

Desired Output

            id    A    B
date                       
2017-02-05  33    20   10
2017-02-07  33    20   30    
2017-02-09  33    0    10
2017-02-11  33    3    0
2017-02-13  33    7    0

E.g. on the right-inclusive end-date 2017-02-13, we sample the 3-day period 2017-02-11 to 2017-02-13. In this period, id=33 had a sum of A points equal to 1+1+1+4 = 7

Attempts

An attempt of groupby with a pd.rolling_sum as follows didn't work, due to repeated dates

df.groupby(['id', 'item'])['points'].apply(pd.rolling_sum, freq='4D', window=3)
ValueError: cannot reindex from a duplicate axis

Also note that from the documentation http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.rolling_apply.html 'window' is an int representing the size sample period, not the number of days to sample.

We can also try resampling and using last, however the desired look-back of 3 days doesn't seem to be used

df.groupby(['id', 'item'])['points'].resample('2D', label='right', closed='right').\
apply(lambda x: x.last('3D').sum())

id  item  date      
33  A     2017-02-05    20
          2017-02-07     0
          2017-02-09     0
          2017-02-11     3
          2017-02-13     4
    B     2017-02-05    10
          2017-02-07    10

Of course,setting up a loop over unique id's ID, selecting df_id = df[df['id']==ID], and summing over the periods does work but is computationally-intensive and doesn't exploit groupby's nice vectorization.

Thanks to @jezrael for good suggestions so far

Notes

Pandas version = 0.20.1

I'm a little confused as to why the documentation on rolling() here:https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rolling.html suggests that the "window" parameter can be in an int or offset but on attempting df.rolling(window='3D',...) I getraise ValueError("window must be an integer") It appears that the above documentation is not consistent with the latest code for rolling's window from ./core/window.py : https://github.com/pandas-dev/pandas/blob/master/pandas/core/window.py

elif not is_integer(self.window):
            raise ValueError("window must be an integer")
Quetzalcoatl
  • 2,016
  • 4
  • 26
  • 36

2 Answers2

3
  • It's easiest to handle resample and rolling with date frequencies when we have a single level datetime index.
  • However, I can't pivot/unstack appropriately without dealing with duplicate A/Bs so I groupby and sum
  • I unstack one level date so I can fill_value=0. Currently, I can't fill_value=0 when I unstack more than one level at a time. I make up for it with a transpose T
  • Now that I've got a single level in the index, I reindex with a date range from the min to max values in the index
  • Finally, I do a rolling 3 day sum and resample that result every 2 days with resample
  • I clean this up with a bit of renaming indices and one more pivot.

s = df.set_index(['id', 'item'], append=True).points
s = s.groupby(level=['date', 'id', 'item']).sum()

d = s.unstack('date', fill_value=0).T
tidx = pd.date_range(d.index.min(), d.index.max())
d = d.reindex(tidx, fill_value=0)

d1 = d.rolling('3D').sum().resample('2D').first().astype(d.dtypes).stack(0)
d1 = d1.rename_axis(['date', 'id']).rename_axis(None, 1)
print(d1)

                A   B
date       id        
2017-02-05 33  20  10
2017-02-07 33  20  20
2017-02-09 33   0   0
2017-02-11 33   3   0
2017-02-13 33   7   0
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • thanks for the explanations. any tips on working with accessing datetime values in rolling methods (e.g. time of most recent nonzero entry in a given window)? at the moment, it seems only arrays of floats X can be accessed in calls like d.rolling('3D').apply(lambda X: func(X)).resample('2D') – Quetzalcoatl Jun 02 '17 at 06:56
  • That sounds like it should be a separate question. The tip would be to use `pd.Series.first_valid_index` or `pd.Series.last_valid_index` after having masked the zeros as `np.nan`. You can do some other tricks in `numpy` as well. – piRSquared Jun 02 '17 at 07:02
2
df = pd.DataFrame(X) 

# group sum by day
df = df.groupby(['date', 'id', 'item'])['points'].sum().reset_index().sort_values(['date', 'id', 'item'])

# convert index to datetime index
df = df.set_index('date')
df.index = DatetimeIndex(df.index)

# rolloing sum by 3D
df['pointsum'] = df.groupby(['id', 'item']).transform(lambda x: x.rolling(window='3D').sum())

# reshape dataframe
df = df.reset_index().set_index(['date', 'id', 'item'])['pointsum'].unstack().reset_index().set_index('date').fillna(0)

df
xmduhan
  • 965
  • 12
  • 14