11

I want to apply a function to a rolling window. All the answers I saw here are focused on applying to a single row / column, but I would like to apply my function to the entire window. Here is a simplified example:

import pandas as pd
data = [ [1,2], [3,4], [3,4], [6,6], [9,1], [11,2] ]
df = pd.DataFrame(columns=list('AB'), data=data)

This is df:

    A   B
0   1   2
1   3   4
2   3   4
3   6   6
4   9   1
5   11  2

Take some function to apply to the entire window:

df.rolling(3).apply(lambda x: x.shape)

In this example, I would like to get something like:

    some_name   
0   NA  
1   NA  
2   (3,2)   
3   (3,2)   
4   (3,2)   
5   (3,2)   

Of course, the shape is used as an example showing f treats the entire window as the object of calculation, not just a row / column. I tried playing with the axis keyword for rolling, as well as with the raw keyword for apply but with no success. Other methods (agg, transform) do not seem to deliver either.

Sure, I can do this with a list comprehension. Just thought there is an easier / cleaner way of doing this.

Yair Daon
  • 1,043
  • 2
  • 15
  • 27
  • Maybe something like this would help: https://stackoverflow.com/questions/20180324/bin-pandas-dataframe-by-every-x-rows – Shaido May 05 '19 at 10:13
  • Does the answer below answer your question? With pandas I don't think there's a cleaner way of doing this. – Ouyang Ze Jun 28 '19 at 06:59

3 Answers3

13

Not with pd.DataFrame.rolling .... that function is applied iteratively to the columns, taking in a series of floats/NaN, and returning a series of floats/NaN, one-by-one. I think you'll have better luck with your intuition....

def rolling_pipe(dataframe, window, fctn):
    return pd.Series([dataframe.iloc[i-window: i].pipe(fctn) 
                      if i >= window else None 
                      for i in range(1, len(dataframe)+1)],
                     index = dataframe.index) 

df.pipe(rolling_pipe, 3, lambda x: x.shape)
Ouyang Ze
  • 490
  • 4
  • 8
  • could you maybe briefly explain what this is doing? Thanks! – user6400946 Mar 03 '21 at 09:32
  • 4
    Sure-- `pd.DataFrame.pipe` is an incredibly useful method. It takes a function as its argument. The function's first input is a pd.DataFrame. To get the most power from `pipe`, you usually want it returning a `Series` or `DataFrame` object so that you can chain these pipes together... but that's a separate topic. – Ouyang Ze Mar 05 '21 at 04:12
  • 5
    In this case, we know that we want to "rolling apply" a function to subsets of the dataframe, starting with a first "cut" of the dataframe which we'll define using the `window` param, get a value returned from `fctn` on that cut of the dataframe (with `.iloc[..].pipe(fctn)`, and then keep rolling down the dataframe this way (with the list comprehension). In this case, the obvious object we want to return is a pd.Series object with the same index (`index=dataframe.index`) as the input dataframe. – Ouyang Ze Mar 05 '21 at 04:26
  • 4
    also two notes: 1, `fctn` here is a function that expects a `pd.DataFrame` as input, and then assumes a non-iterable output like a number or string. There is a version of this function that could return dataframes instead of series, just not as its written above. 2, since this post I've come across a similar looking function called `pd.rolling_apply`, but the documentation for it is lacking, so you'd have to test it yourself to see if it's doing the same thing as `rolling_pipe`. – Ouyang Ze Mar 05 '21 at 04:32
1

The argument supplied to your apply function is a Series with an index property containing start, stop and step properties.

RangeIndex(start=0, stop=2, step=1)

You can use this to query your data frame.

df = pd.DataFrame([('Sean', i) for i in range(1,11)], columns=['name', 'value'])

def func(series):
    view = df.iloc[series.index]
    # use view to do something...
    count = len(view[view.value.isin([1,2,8])])
    return count

df['count'] = df.value.rolling(2).apply(func)

There may be a more efficient way to do this but I'm not sure how.

Dharman
  • 30,962
  • 25
  • 85
  • 135
seanbehan
  • 1,463
  • 15
  • 23
1

If you need rolling application over a datetime-like index, the other answers are not sufficient.

You have to resort to manually iterating over the Rolling object, and reconstructing the result into a Series or DataFrame as needed:

from datetime import (
    datetime as DateTime,
    timedelta as TimeDelta,
)
import pandas as pd

now = DateTime.now(tz=TimeZone.utc)

df = pd.DataFrame([
    {'t': now + TimeDelta(days=1), 'x': 11, 'y': 21},
    {'t': now + TimeDelta(days=2), 'x': 12, 'y': 22},
    {'t': now + TimeDelta(days=3), 'x': 13, 'y': 23},
    {'t': now + TimeDelta(days=4), 'x': 14, 'y': 24},
]).set_index('t')

results = {}
for group in df.rolling('2D'):
    # Perform a silly calculation, in this case an aggregation
    result = group['y'].min() * group['x'].max()
    # Choose a value to use as the resulting index
    index = group.index.min()
    results[index] = result
results = pd.Series(results)
print(results)
2022-07-15 01:41:05.121823+00:00    252
2022-07-16 01:41:05.121823+00:00    286
2022-07-17 01:41:05.121823+00:00    322
dtype: int64

This works analogously to iterating over a GroupBy object. Unfortunately however, and unlike with GroupBy, iterating does not yield the actual bounds that are used for the rolling window. I am not aware of a way to manually obtain these.

I expected that this should also be possible with the new method= kwarg in DataFrame.rolling, but I wasn't able to get it to work properly. I will post a separate answer if I figure it out!

shadowtalker
  • 12,529
  • 3
  • 53
  • 96