7

I am working on a code that would apply a rolling window to a function that would return multiple columns.

Input: Pandas Series
Expected output: 3-column DataFrame

def fun1(series, ):
    # Some calculations producing numbers a, b and c
    return {"a": a, "b": b, "c": c} 

res.rolling('21 D').apply(fun1)

Contents of res:

time
2019-09-26 16:00:00    0.674969
2019-09-26 16:15:00    0.249569
2019-09-26 16:30:00   -0.529949
2019-09-26 16:45:00   -0.247077
2019-09-26 17:00:00    0.390827
                         ...   
2019-10-17 22:45:00    0.232998
2019-10-17 23:00:00    0.590827
2019-10-17 23:15:00    0.768991
2019-10-17 23:30:00    0.142661
2019-10-17 23:45:00   -0.555284
Length: 1830, dtype: float64

Error:

TypeError: must be real number, not dict

What I've tried:

  • Changing raw=True in apply
  • Using a lambda function in in apply
  • Returning result in fun1 as lists/numpy arrays/dataframe/series.

I have also went through many related posts in SO, to state a few:

But none of the solution specified solves this problem.

Is there a straight-forward solution to this?

Ébe Isaac
  • 11,563
  • 17
  • 64
  • 97

3 Answers3

7

Here is a hacky answer using rolling, producing a DataFrame:

import pandas as pd
import numpy as np

dr = pd.date_range('09-26-2019', '10-17-2019', freq='15T')
data = np.random.rand(len(dr))

s = pd.Series(data, index=dr)

output = pd.DataFrame(columns=['a','b','c'])

row = 0

def compute(window, df):
    global row
    a = window.max()
    b = window.min()
    c = a - b
    df.loc[row,['a','b','c']] = [a,b,c]
    row+=1    
    return 1
    
s.rolling('1D').apply(compute,kwargs={'df':output})

output.index = s.index

It seems like the rolling apply function is always expecting a number to be returned, in order to immediately generate a new Series based on the calculations.

I am getting around this by making a new output DataFrame (with the desired output columns), and writing to that within the function. I'm not sure if there is a way to get the index within a rolling object, so I instead use global to make an increasing count for writing new rows. In light of the point above though, you need to return some number. So while the actually rolling operation returns a series of 1, output is modified:

In[0]:
s

Out[0]:
2019-09-26 00:00:00    0.106208
2019-09-26 00:15:00    0.979709
2019-09-26 00:30:00    0.748573
2019-09-26 00:45:00    0.702593
2019-09-26 01:00:00    0.617028
  
2019-10-16 23:00:00    0.742230
2019-10-16 23:15:00    0.729797
2019-10-16 23:30:00    0.094662
2019-10-16 23:45:00    0.967469
2019-10-17 00:00:00    0.455361
Freq: 15T, Length: 2017, dtype: float64

In[1]:
output

Out[1]:
                           a         b         c
2019-09-26 00:00:00  0.106208  0.106208  0.000000
2019-09-26 00:15:00  0.979709  0.106208  0.873501
2019-09-26 00:30:00  0.979709  0.106208  0.873501
2019-09-26 00:45:00  0.979709  0.106208  0.873501
2019-09-26 01:00:00  0.979709  0.106208  0.873501
                      ...       ...       ...
2019-10-16 23:00:00  0.980544  0.022601  0.957943
2019-10-16 23:15:00  0.980544  0.022601  0.957943
2019-10-16 23:30:00  0.980544  0.022601  0.957943
2019-10-16 23:45:00  0.980544  0.022601  0.957943
2019-10-17 00:00:00  0.980544  0.022601  0.957943

[2017 rows x 3 columns]

This feels like more of an exploit of rolling than an intended use, so I would be interested to see a more elegant answer.

UPDATE: Thanks to @JuanPi, you can get the rolling window index using this answer. So a non-globalanswer could look like this:

def compute(window, df):
    a = window.max()
    b = window.min()
    c = a - b
    df.loc[window.index.max(),['a','b','c']] = [a,b,c]  
    return 1
Tom
  • 8,310
  • 2
  • 16
  • 36
  • 2
    You can get the indexes of the current window using the trick in this answer https://stackoverflow.com/a/60918101 – JuanPi Jul 05 '20 at 13:28
  • @JuanPi Thanks for sharing, I was going to ask about this! I updated my answer to include this – Tom Jul 05 '20 at 16:17
  • Not so hacky, you're basically leveraging pandas rolling functionality as a window generator. What you don't get are the leading NaNs with usual rolling window, but they could be prepended if required. – GratefulGuest Mar 04 '21 at 02:51
4

This hack seem to work for me, albeit the additional features of rolling cannot be applied to this solution. However, the speed of the application is significantly faster due to multiprocessing.

from multiprocessing import Pool
import functools


def apply_fn(indices, fn, df):
    return fn(df.loc[indices])
              
    
def rolling_apply(df, fn, window_size, start=None, end=None):
    """
    The rolling application of a function fn on a DataFrame df given the window_size
    """
    x = df.index
    if start is not None:
        x = x[x >= start]
    if end is not None:
        x = x[x <= end]
    if type(window_size) == str:
        delta = pd.Timedelta(window_size)
        index_sets = [x[(x > (i - delta)) & (x <= i)] for i in x]
    else: 
        assert type(window_size) == int, "Window size should be str (representing Timedelta) or int"
        delta = window_size
        index_sets = [x[(x > (i - delta)) & (x <= i)] for i in x]
    
    with Pool() as pool:
        result = list(pool.map(functools.partial(apply_fn, fn=fn, df=df), index_sets))
    result = pd.DataFrame(data=result, index=x)
        
    return result

Having the above functions in place, plug in the function to roll into the custom rolling_function.

result = rolling_apply(res, fun1, "21 D")

Contents of result:

                    a           b           c
time            
2019-09-26 16:00:00 NaN         NaN         NaN
2019-09-26 16:15:00 0.500000    0.106350    0.196394
2019-09-26 16:30:00 0.500000    0.389759    -0.724829
2019-09-26 16:45:00 2.000000    0.141436    -0.529949
2019-09-26 17:00:00 6.010184    0.141436    -0.459231
... ... ... ...
2019-10-17 22:45:00 4.864015    0.204483    -0.761609
2019-10-17 23:00:00 6.607717    0.204647    -0.761421
2019-10-17 23:15:00 7.466364    0.204932    -0.761108
2019-10-17 23:30:00 4.412779    0.204644    -0.760386
2019-10-17 23:45:00 0.998308    0.203039    -0.757979
1830 rows × 3 columns

Note:

  • This implementation works for both Series and DataFrame input
  • This implementation works for both time and integer windows
  • The result returned by fun1 can even be a list, numpy array, series or a dictionary
  • The window_size considers only the max window size, so all starting indices below the window_size would have their windows include all elements up to the starting element.
  • The apply function should not be nested inside the rolling_apply function since the pool.map cannot accept local or lambda functions as they cannot be 'pickled' according to the multiprocessing library
Ébe Isaac
  • 11,563
  • 17
  • 64
  • 97
0

You can use rolling() and apply() seperately to get mulitple columns. Create a Rolling Dataframe from the original Dataframe once and use .apply() multiple times.

For a Dataframe named 'df':

windows = df.rolling(window_size)
a_series = windows.apply(lambda x: find_a_for_single_window(x))
b_series = windows.apply(lambda x: find_b_for_single_window(x))
c_series = windows.apply(lambda x: find_c_for_single_window(x))