13

Consider you've got some unevenly time series data:

import pandas as pd
import random as randy
ts = pd.Series(range(1000),index=randy.sample(pd.date_range('2013-02-01 09:00:00.000000',periods=1e6,freq='U'),1000)).sort_index()
print ts.head()


2013-02-01 09:00:00.002895    995
2013-02-01 09:00:00.003765    499
2013-02-01 09:00:00.003838    797
2013-02-01 09:00:00.004727    295
2013-02-01 09:00:00.006287    253

Let's say I wanted to do the rolling sum over a 1ms window to get this:

2013-02-01 09:00:00.002895    995
2013-02-01 09:00:00.003765    499 + 995
2013-02-01 09:00:00.003838    797 + 499 + 995
2013-02-01 09:00:00.004727    295 + 797 + 499
2013-02-01 09:00:00.006287    253

Currently, I cast everything back to longs and do this in cython, but is this possible in pure pandas? I'm aware that you can do something like .asfreq('U') and then fill and use the traditional functions but this doesn't scale once you've got more than a toy # of rows.

As a point of reference, here's a hackish, not fast Cython version:

%%cython
import numpy as np
cimport cython
cimport numpy as np

ctypedef np.double_t DTYPE_t

def rolling_sum_cython(np.ndarray[long,ndim=1] times, np.ndarray[double,ndim=1] to_add, long window_size):
    cdef long t_len = times.shape[0], s_len = to_add.shape[0], i =0, win_size = window_size, t_diff, j, window_start
    cdef np.ndarray[DTYPE_t, ndim=1] res = np.zeros(t_len, dtype=np.double)
    assert(t_len==s_len)
    for i in range(0,t_len):
        window_start = times[i] - win_size
        j = i
        while times[j]>= window_start and j>=0:
            res[i] += to_add[j]
            j-=1
    return res   

Demonstrating this on a slightly larger series:

ts = pd.Series(range(100000),index=randy.sample(pd.date_range('2013-02-01 09:00:00.000000',periods=1e8,freq='U'),100000)).sort_index()

%%timeit
res2 = rolling_sum_cython(ts.index.astype(int64),ts.values.astype(double),long(1e6))
1000 loops, best of 3: 1.56 ms per loop
JJJ
  • 1,009
  • 6
  • 19
  • 31
radikalus
  • 555
  • 5
  • 15
  • add column with time rounded to millisec and groupby it, apply cumsum within each group – lowtech Jan 17 '14 at 23:06
  • `ts_df['millis_rounded'] = ts.index.astype(int64) / 1000000 * 1000000` `res3 = ts_df.groupby("millis_rounded")[0].agg(np.sum)` Is this what you're talking about? That just sums the entries in a particular millisecond bucket, it's not a sliding window. It's also not terribly quick. (About 10x as slow as the above cython I think) – radikalus Feb 02 '14 at 04:50

4 Answers4

12

You can solve most problems of this sort with cumsum and binary search.

from datetime import timedelta

def msum(s, lag_in_ms):
    lag = s.index - timedelta(milliseconds=lag_in_ms)
    inds = np.searchsorted(s.index.astype(np.int64), lag.astype(np.int64))
    cs = s.cumsum()
    return pd.Series(cs.values - cs[inds].values + s[inds].values, index=s.index)

res = msum(ts, 100)
print pd.DataFrame({'a': ts, 'a_msum_100': res})


                            a  a_msum_100
2013-02-01 09:00:00.073479  5           5
2013-02-01 09:00:00.083717  8          13
2013-02-01 09:00:00.162707  1          14
2013-02-01 09:00:00.171809  6          20
2013-02-01 09:00:00.240111  7          14
2013-02-01 09:00:00.258455  0          14
2013-02-01 09:00:00.336564  2           9
2013-02-01 09:00:00.536416  3           3
2013-02-01 09:00:00.632439  4           7
2013-02-01 09:00:00.789746  9           9

[10 rows x 2 columns]

You need a way of handling NaNs and depending on your application, you may need the prevailing value asof the lagged time or not (ie difference between using kdb+ bin vs np.searchsorted).

Hope this helps.

signalseeker
  • 4,100
  • 7
  • 30
  • 36
12

This is an old question, but for those who stumble upon this from google: in pandas 0.19 this is built-in as the function

http://pandas.pydata.org/pandas-docs/stable/computation.html#time-aware-rolling

So to get 1 ms windows it looks like you get a Rolling object by doing

dft.rolling('1ms')

and the sum would be

dft.rolling('1ms').sum()
Kevin Wang
  • 2,673
  • 2
  • 10
  • 18
1

Perhaps it makes more sense to use rolling_sum:

pd.rolling_sum(ts, window=1, freq='1ms')
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • This does not actually work -- this samples the data to 1ms frequency then just sums the bucket, yielding: 2013-02-01 09:00:00.002000 995 2013-02-01 09:00:00.003000 648 2013-02-01 09:00:00.004000 295 2013-02-01 09:00:00.005000 NaN 2013-02-01 09:00:00.006000 617 – radikalus Jan 31 '13 at 22:53
0

How about something like this:

Create an offset for 1 ms:

In [1]: ms = tseries.offsets.Milli()

Create a series of index positions the same length as your timeseries:

In [2]: s = Series(range(len(ts)))

Apply a lambda function that indexes the current time from the ts series. The function returns the sum of all ts entries between x - ms and x.

In [3]: s.apply(lambda x: ts.between_time(start_time=ts.index[x]-ms, end_time=ts.index[x]).sum())

In [4]: ts.head()
Out[4]:
2013-02-01 09:00:00.000558    348
2013-02-01 09:00:00.000647    361
2013-02-01 09:00:00.000726    312
2013-02-01 09:00:00.001012    550
2013-02-01 09:00:00.002208    758

Results of the above function:

0     348
1     709
2    1021
3    1571
4     758
Zelazny7
  • 39,946
  • 18
  • 70
  • 84