3

Given the following high-frequency but sparse time series:

#Sparse Timeseries
dti1 = pd.date_range(start=datetime(2015,8,1,9,0,0),periods=10,freq='ms')
dti2 = pd.date_range(start=datetime(2015,8,1,9,0,10),periods=10,freq='ms')
dti = dti1 + dti2

ts = pd.Series(index=dti, data=range(20))

I can compute an exponentially weighted moving average with a halflife of 5ms using a pandas function as follows:

ema = pd.ewma(ts, halflife=5, freq='ms')

However, under the hood, the function is resampling my timeseries with an interval of 1 ms (which is the 'freq' that I supplied). This causes thousands of additional datapoints to be included in the output.

In [118]: len(ts)
Out[118]: 20
In [119]: len(ema)
Out[119]: 10010

This is not scalable, as my real Timeseries contains hundreds of thousands of high-frequency observations that are minutes or hours apart.

Is there a Pandas/numpy way of computing an EMA for a sparse timeseries without resampling? Something similar to this: http://oroboro.com/irregular-ema/

Or, do i have to write my own? Thanks!

nickos556
  • 337
  • 3
  • 16
  • According to the link, the formula is just a few lines. I'd just code it up if I were you -- probably as a numba function since it doesn't look to easily vectorize. Though maybe if you can write the formula with cumsum/cumprod it would be reasonably fast? I dunno, should be straightforward to do in numba, or I guess cython would be a good option too. – JohnE Aug 09 '15 at 21:45

1 Answers1

0

You can use reindex to align the ewma result with your original series.

pd.ewma(ts, halflife=5, freq='ms').reindex(ts.index)

2015-08-01 09:00:00.000     0.0000
2015-08-01 09:00:00.001     0.5346
2015-08-01 09:00:00.002     1.0921
2015-08-01 09:00:00.003     1.6724
2015-08-01 09:00:00.004     2.2750
2015-08-01 09:00:00.005     2.8996
2015-08-01 09:00:00.006     3.5458
2015-08-01 09:00:00.007     4.2131
2015-08-01 09:00:00.008     4.9008
2015-08-01 09:00:00.009     5.6083
2015-08-01 09:00:10.000    10.0000
2015-08-01 09:00:10.001    10.5346
2015-08-01 09:00:10.002    11.0921
2015-08-01 09:00:10.003    11.6724
2015-08-01 09:00:10.004    12.2750
2015-08-01 09:00:10.005    12.8996
2015-08-01 09:00:10.006    13.5458
2015-08-01 09:00:10.007    14.2131
2015-08-01 09:00:10.008    14.9008
2015-08-01 09:00:10.009    15.6083
dtype: float64
Jianxun Li
  • 24,004
  • 10
  • 58
  • 76
  • Thanks for your suggestion. Yep I know that this is possible, however, this does not fix the scalability issue, since pd.ewma() is still re sampling under the hood. Eg imagine doing this with 1gb of input data -when it's resampled it could grow to hundreds of gb or more. – nickos556 Aug 03 '15 at 07:01