28

I'm looking for a way to do something like the various rolling_* functions of pandas, but I want the window of the rolling computation to be defined by a range of values (say, a range of values of a column of the DataFrame), not by the number of rows in the window.

As an example, suppose I have this data:

>>> print d
   RollBasis  ToRoll
0          1       1
1          1       4
2          1      -5
3          2       2
4          3      -4
5          5      -2
6          8       0
7         10     -13
8         12      -2
9         13      -5

If I do something like rolling_sum(d, 5), I get a rolling sum in which each window contains 5 rows. But what I want is a rolling sum in which each window contains a certain range of values of RollBasis. That is, I'd like to be able to do something like d.roll_by(sum, 'RollBasis', 5), and get a result where the first window contains all rows whose RollBasis is between 1 and 5, then the second window contains all rows whose RollBasis is between 2 and 6, then the third window contains all rows whose RollBasis is between 3 and 7, etc. The windows will not have equal numbers of rows, but the range of RollBasis values selected in each window will be the same. So the output should be like:

>>> d.roll_by(sum, 'RollBasis', 5)
    1    -4    # sum of elements with 1 <= Rollbasis <= 5
    2    -4    # sum of elements with 2 <= Rollbasis <= 6
    3    -6    # sum of elements with 3 <= Rollbasis <= 7
    4    -2    # sum of elements with 4 <= Rollbasis <= 8
    # etc.

I can't do this with groupby, because groupby always produces disjoint groups. I can't do it with the rolling functions, because their windows always roll by number of rows, not by values. So how can I do it?

BrenBarn
  • 242,874
  • 37
  • 412
  • 384

4 Answers4

21

I think this does what you want:

In [1]: df
Out[1]:
   RollBasis  ToRoll
0          1       1
1          1       4
2          1      -5
3          2       2
4          3      -4
5          5      -2
6          8       0
7         10     -13
8         12      -2
9         13      -5

In [2]: def f(x):
   ...:     ser = df.ToRoll[(df.RollBasis >= x) & (df.RollBasis < x+5)]
   ...:     return ser.sum()

The above function takes a value, in this case RollBasis, and then indexes the data frame column ToRoll based on that value. The returned series consists of ToRoll values that meet the RollBasis + 5 criterion. Finally, that series is summed and returned.

In [3]: df['Rolled'] = df.RollBasis.apply(f)

In [4]: df
Out[4]:
   RollBasis  ToRoll  Rolled
0          1       1      -4
1          1       4      -4
2          1      -5      -4
3          2       2      -4
4          3      -4      -6
5          5      -2      -2
6          8       0     -15
7         10     -13     -20
8         12      -2      -7
9         13      -5      -5

Code for the toy example DataFrame in case someone else wants to try:

In [1]: from pandas import *

In [2]: import io

In [3]: text = """\
   ...:    RollBasis  ToRoll
   ...: 0          1       1
   ...: 1          1       4
   ...: 2          1      -5
   ...: 3          2       2
   ...: 4          3      -4
   ...: 5          5      -2
   ...: 6          8       0
   ...: 7         10     -13
   ...: 8         12      -2
   ...: 9         13      -5
   ...: """

In [4]: df = read_csv(io.BytesIO(text), header=0, index_col=0, sep='\s+')
Zelazny7
  • 39,946
  • 18
  • 70
  • 84
  • Thanks, that seems to do it. I added my own answer with a more generalized version of this, but I'm accepting your answer. – BrenBarn Jan 13 '13 at 20:46
16

Based on Zelazny7's answer, I created this more general solution:

def rollBy(what, basis, window, func):
    def applyToWindow(val):
        chunk = what[(val<=basis) & (basis<val+window)]
        return func(chunk)
    return basis.apply(applyToWindow)

>>> rollBy(d.ToRoll, d.RollBasis, 5, sum)
0    -4
1    -4
2    -4
3    -4
4    -6
5    -2
6   -15
7   -20
8    -7
9    -5
Name: RollBasis

It's still not ideal as it is very slow compared to rolling_apply, but perhaps this is inevitable.

BrenBarn
  • 242,874
  • 37
  • 412
  • 384
  • 1
    This is much quicker if instead of filtering on the value of a second column, you filter on the value of an index. Pandas indexes currently support non-unique entries, so you solution could be sped up by setting the basis column as the index and then filtering on that. – Ian Sudbery Jan 16 '14 at 13:36
16

Based on BrenBarns's answer, but speeded up by using label based indexing rather than boolean based indexing:

def rollBy(what,basis,window,func,*args,**kwargs):
    #note that basis must be sorted in order for this to work properly     
    indexed_what = pd.Series(what.values,index=basis.values)
    def applyToWindow(val):
        # using slice_indexer rather that what.loc [val:val+window] allows
        # window limits that are not specifically in the index
        indexer = indexed_what.index.slice_indexer(val,val+window,1)
        chunk = indexed_what[indexer]
        return func(chunk,*args,**kwargs)
    rolled = basis.apply(applyToWindow)
    return rolled

This is much faster than not using an indexed column:

In [46]: df = pd.DataFrame({"RollBasis":np.random.uniform(0,1000000,100000), "ToRoll": np.random.uniform(0,10,100000)})

In [47]: df = df.sort("RollBasis")

In [48]: timeit("rollBy_Ian(df.ToRoll,df.RollBasis,10,sum)",setup="from __main__ import rollBy_Ian,df", number =3)
Out[48]: 67.6615059375763

In [49]: timeit("rollBy_Bren(df.ToRoll,df.RollBasis,10,sum)",setup="from __main__ import rollBy_Bren,df", number =3)
Out[49]: 515.0221037864685

Its worth noting that the index based solution is O(n), while the logical slicing version is O(n^2) in the average case (I think).

I find it more useful to do this over evenly spaced windows from the min value of Basis to the max value of Basis, rather than at every value of basis. This means altering the function thus:

def rollBy(what,basis,window,func,*args,**kwargs):
    #note that basis must be sorted in order for this to work properly
    windows_min = basis.min()
    windows_max = basis.max()
    window_starts = np.arange(windows_min, windows_max, window)
    window_starts = pd.Series(window_starts, index = window_starts)
    indexed_what = pd.Series(what.values,index=basis.values)
    def applyToWindow(val):
        # using slice_indexer rather that what.loc [val:val+window] allows
        # window limits that are not specifically in the index
        indexer = indexed_what.index.slice_indexer(val,val+window,1)
        chunk = indexed_what[indexer]
        return func(chunk,*args,**kwargs)
    rolled = window_starts.apply(applyToWindow)
    return rolled
Ian Sudbery
  • 1,708
  • 13
  • 15
  • For this to work correctly (at least in pandas 0.14), I think you need to replace chunk = indexed_what[indexer] by chunk = indexed_what.iloc[indexer]. Otherwise the slice is treated as an index range but it is a positional range. – feilchenfeldt Nov 04 '14 at 10:01
  • This does not return the same results as Bren's answer because of how they treat open/closed intervals (see result for index 5 in the example). It also doesn't allow what to be a DF instead of just a series. – Luis Feb 06 '18 at 11:11
  • Used this with datetime/timestamp and it still works. Requires a bit of massaging though. As for Luis comment about not the same results, just reduce the window int by one, as slice_indexer is inclusive on both start and end. (>=,<= vs >=,<). – Ranald Fong Mar 19 '21 at 08:34
  • Another solution to get things to work properly, is to replace `chunk = indexed_what[indexer]` with `chunk = indexed_what.index[indexer]`. I haven't done testing whether this is faster than the iloc solution by @Luis. – PrinsEdje80 Nov 09 '21 at 07:01
1

To extend the answer of @Ian Sudbury, I've extended it in such a way that one can use it directly on a dataframe by binding the method to the DataFrame class (I expect that there definitely might be some improvements on my code in speed, because I do not know how to access all internals of the class).

I've also added functionality for backward facing windows and centered windows. They only function perfectly when you're away from the edges.

import pandas as pd
import numpy as np

def roll_by(self, basis, window, func, forward=True, *args, **kwargs):
    the_indexed = pd.Index(self[basis])
    def apply_to_window(val):
        if forward == True:
            indexer = the_indexed.slice_indexer(val, val+window)
        elif forward == False:
            indexer = the_indexed.slice_indexer(val-window, val)
        elif forward == 'both':
            indexer = the_indexed.slice_indexer(val-window/2, val+window/2)
        else:
            raise RuntimeError('Invalid option for "forward". Can only be True, False, or "both".')
        chunck = self.iloc[indexer]
        return func(chunck, *args, **kwargs)
    rolled = self[basis].apply(apply_to_window)
    return rolled

pd.DataFrame.roll_by = roll_by

For the other tests, I've used the following definitions:

def rollBy_Ian_iloc(what,basis,window,func,*args,**kwargs):
    #note that basis must be sorted in order for this to work properly   
    indexed_what = pd.Series(what.values,index=basis.values)
    def applyToWindow(val):
        # using slice_indexer rather that what.loc [val:val+window] allows
        # window limits that are not specifically in the index
        indexer = indexed_what.index.slice_indexer(val,val+window,1)
        chunk = indexed_what.iloc[indexer]
        return func(chunk,*args,**kwargs)
    rolled = basis.apply(applyToWindow)
    return rolled

def rollBy_Ian_index(what,basis,window,func,*args,**kwargs):
    #note that basis must be sorted in order for this to work properly   
    indexed_what = pd.Series(what.values,index=basis.values)
    def applyToWindow(val):
        # using slice_indexer rather that what.loc [val:val+window] allows
        # window limits that are not specifically in the index
        indexer = indexed_what.index.slice_indexer(val,val+window,1)
        chunk = indexed_what[indexed_what.index[indexer]]
        return func(chunk,*args,**kwargs)
    rolled = basis.apply(applyToWindow)
    return rolled

def rollBy_Bren(what, basis, window, func):
    def applyToWindow(val):
        chunk = what[(val<=basis) & (basis<val+window)]
        return func(chunk)
    return basis.apply(applyToWindow)

Timings and tests:

df = pd.DataFrame({"RollBasis":np.random.uniform(0,100000,10000), "ToRoll": np.random.uniform(0,10,10000)}).sort_values("RollBasis")
In [14]: %timeit rollBy_Ian_iloc(df.ToRoll,df.RollBasis,10,sum)
    ...: %timeit rollBy_Ian_index(df.ToRoll,df.RollBasis,10,sum)
    ...: %timeit rollBy_Bren(df.ToRoll,df.RollBasis,10,sum)
    ...: %timeit df.roll_by('RollBasis', 10, lambda x: x['ToRoll'].sum())
    ...: 
484 ms ± 28.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.58 s ± 10.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
3.12 s ± 22.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.48 s ± 45.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Conclusion: the bound method is not as fast as the method by @Ian Sudbury, but not as slow as that of @BrenBarn, but it does allow for more flexibility regarding the functions one can call on them.

PrinsEdje80
  • 494
  • 4
  • 8