Pandas Rolling Computations on Sliding Windows (Unevenly spaced)

Question

Consider you've got some unevenly time series data:

import pandas as pd
import random as randy
ts = pd.Series(range(1000),index=randy.sample(pd.date_range('2013-02-01 09:00:00.000000',periods=1e6,freq='U'),1000)).sort_index()
print ts.head()


2013-02-01 09:00:00.002895    995
2013-02-01 09:00:00.003765    499
2013-02-01 09:00:00.003838    797
2013-02-01 09:00:00.004727    295
2013-02-01 09:00:00.006287    253

Let's say I wanted to do the rolling sum over a 1ms window to get this:

2013-02-01 09:00:00.002895    995
2013-02-01 09:00:00.003765    499 + 995
2013-02-01 09:00:00.003838    797 + 499 + 995
2013-02-01 09:00:00.004727    295 + 797 + 499
2013-02-01 09:00:00.006287    253

Currently, I cast everything back to longs and do this in cython, but is this possible in pure pandas? I'm aware that you can do something like .asfreq('U') and then fill and use the traditional functions but this doesn't scale once you've got more than a toy # of rows.

As a point of reference, here's a hackish, not fast Cython version:

%%cython
import numpy as np
cimport cython
cimport numpy as np

ctypedef np.double_t DTYPE_t

def rolling_sum_cython(np.ndarray[long,ndim=1] times, np.ndarray[double,ndim=1] to_add, long window_size):
    cdef long t_len = times.shape[0], s_len = to_add.shape[0], i =0, win_size = window_size, t_diff, j, window_start
    cdef np.ndarray[DTYPE_t, ndim=1] res = np.zeros(t_len, dtype=np.double)
    assert(t_len==s_len)
    for i in range(0,t_len):
        window_start = times[i] - win_size
        j = i
        while times[j]>= window_start and j>=0:
            res[i] += to_add[j]
            j-=1
    return res

Demonstrating this on a slightly larger series:

ts = pd.Series(range(100000),index=randy.sample(pd.date_range('2013-02-01 09:00:00.000000',periods=1e8,freq='U'),100000)).sort_index()

%%timeit
res2 = rolling_sum_cython(ts.index.astype(int64),ts.values.astype(double),long(1e6))
1000 loops, best of 3: 1.56 ms per loop

add column with time rounded to millisec and groupby it, apply cumsum within each group — lowtech, Jan 17 '14 at 23:06
`ts_df['millis_rounded'] = ts.index.astype(int64) / 1000000 * 1000000` `res3 = ts_df.groupby("millis_rounded")[0].agg(np.sum)` Is this what you're talking about? That just sums the entries in a particular millisecond bucket, it's not a sliding window. It's also not terribly quick. (About 10x as slow as the above cython I think) — radikalus, Feb 02 '14 at 04:50

signalseeker · Accepted Answer · 2014-05-16T02:09:04.030

You can solve most problems of this sort with cumsum and binary search.

from datetime import timedelta

def msum(s, lag_in_ms):
    lag = s.index - timedelta(milliseconds=lag_in_ms)
    inds = np.searchsorted(s.index.astype(np.int64), lag.astype(np.int64))
    cs = s.cumsum()
    return pd.Series(cs.values - cs[inds].values + s[inds].values, index=s.index)

res = msum(ts, 100)
print pd.DataFrame({'a': ts, 'a_msum_100': res})


                            a  a_msum_100
2013-02-01 09:00:00.073479  5           5
2013-02-01 09:00:00.083717  8          13
2013-02-01 09:00:00.162707  1          14
2013-02-01 09:00:00.171809  6          20
2013-02-01 09:00:00.240111  7          14
2013-02-01 09:00:00.258455  0          14
2013-02-01 09:00:00.336564  2           9
2013-02-01 09:00:00.536416  3           3
2013-02-01 09:00:00.632439  4           7
2013-02-01 09:00:00.789746  9           9

[10 rows x 2 columns]

You need a way of handling NaNs and depending on your application, you may need the prevailing value asof the lagged time or not (ie difference between using kdb+ bin vs np.searchsorted).

Hope this helps.

score 12 · Answer 2 · answered Dec 16 '16 at 02:49

This is an old question, but for those who stumble upon this from google: in pandas 0.19 this is built-in as the function

http://pandas.pydata.org/pandas-docs/stable/computation.html#time-aware-rolling

So to get 1 ms windows it looks like you get a Rolling object by doing

dft.rolling('1ms')

and the sum would be

dft.rolling('1ms').sum()

score 1 · Answer 3 · answered Jan 31 '13 at 18:02

1

Perhaps it makes more sense to use rolling_sum:

pd.rolling_sum(ts, window=1, freq='1ms')

answered Jan 31 '13 at 18:02

Andy Hayden

359,921
101
625
535

This does not actually work -- this samples the data to 1ms frequency then just sums the bucket, yielding: 2013-02-01 09:00:00.002000 995 2013-02-01 09:00:00.003000 648 2013-02-01 09:00:00.004000 295 2013-02-01 09:00:00.005000 NaN 2013-02-01 09:00:00.006000 617 – radikalus Jan 31 '13 at 22:53

score 0 · Answer 4 · answered Feb 01 '13 at 01:53

How about something like this:

Create an offset for 1 ms:

In [1]: ms = tseries.offsets.Milli()

Create a series of index positions the same length as your timeseries:

In [2]: s = Series(range(len(ts)))

Apply a lambda function that indexes the current time from the ts series. The function returns the sum of all ts entries between x - ms and x.

In [3]: s.apply(lambda x: ts.between_time(start_time=ts.index[x]-ms, end_time=ts.index[x]).sum())

In [4]: ts.head()
Out[4]:
2013-02-01 09:00:00.000558    348
2013-02-01 09:00:00.000647    361
2013-02-01 09:00:00.000726    312
2013-02-01 09:00:00.001012    550
2013-02-01 09:00:00.002208    758

Results of the above function:

Pandas Rolling Computations on Sliding Windows (Unevenly spaced)

4 Answers4

Linked