3

I am working with two pandas Series with Timestamps as indices. One Series is a coarse model with a fixed frequency, the other one are data with no fixed frequency. I would like to subtract the model from the data and (linearly or spline) interpolate the values of the model.

Here is an example:

import numpy as np
import pandas as pd


# generate model with fixed freq
model = pd.Series(range(5),index=pd.date_range('2017-06-19T12:05:00', '2017-06-19T12:25:00', freq="5 min"))

# generate data and add more_data to make frequency irregular
data = pd.Series(np.arange(10)+0.3,index=pd.date_range('2017-06-19T12:06:00', 
'2017-06-19T12:24:00', freq="2 min"))
more_data = pd.Series([-10, -20], index=[pd.Timestamp('2017-06-19T12:07:35'), 
pd.Timestamp('2017-06-19T12:09:10')])
data = data.append(more_data).sort_index()

I tried

data - model.interpolate()[data.index]

but that only gives me non-NaN values where the timestamps of the model and the data overlap.

I understand that I could resample the data to fit the frequency of the model () but I do want to have the data minus the model at the original timestamps of the data.

frankundfrei
  • 296
  • 3
  • 11

2 Answers2

3

So with the help of this answer I figured out the solution to my problem, only interpolating at the points actually needed:

First, generate a Series of NaNs with the timestamps of data:

na = pd.Series(None, data.index)

and combine this with the model:

combi = model.combine_first(na)

This Series can now be interpolated and subtracted from the data

(data - combi.interpolate(method='time'))[data.index]

or as a one-liner

(data - model.combine_first(pd.Series(None, data.index)).interpolate(method='time'))[data.index]
frankundfrei
  • 296
  • 3
  • 11
  • 1
    I like this a lot. Only thing I'd add, to both mine, which I might edit, and here, is adding `method='time'` as an argument to `interpolate`, so that the interpolation actually uses these datetime indices we've been so careful to preserve. – EFT Jun 20 '17 at 13:11
  • 1
    Looked into it further, and really as long as any option but the default `method='linear'` is used, it behaves properly for both approaches. – EFT Jun 20 '17 at 13:39
  • With combine_first(na), I get "ValueError: Must specify axis=0 or 1" – S. Jessen Jun 30 '21 at 11:54
1

Idea:

You could find the gcd of the values in the index of data in nanoseconds, then resample the model to fit the frequency of the data.

Method:

Construct a gcd function for numpy arrays using a method found here, and feed it data.index.astype(np.int64):

divisor = np.ufunc.reduce(np.frompyfunc(math.gcd, 2, 1),
                          data.index.astype(np.int64))
divisor
Out[91]: 5000000000

Then resample model and proceed as before:

data - model.resample(str(divisor)+'ns').interpolate(method='time')[data.index]
    Out[61]: 
2017-06-19 12:06:00     0.100000
2017-06-19 12:07:35   -10.516667
2017-06-19 12:08:00     0.700000
2017-06-19 12:09:10   -20.833333
2017-06-19 12:10:00     1.300000
2017-06-19 12:12:00     1.900000
2017-06-19 12:14:00     2.500000
2017-06-19 12:16:00     3.100000
2017-06-19 12:18:00     3.700000
2017-06-19 12:20:00     4.300000
2017-06-19 12:22:00     4.900000
2017-06-19 12:24:00     5.500000
dtype: float64
EFT
  • 2,359
  • 1
  • 10
  • 11
  • This works for my current dataset, though the resampling is quiet slow. I also ran into a MemoryError for larger datasets. Iterating over chunks of the Series wouldn't be elegant but should work, though. – frankundfrei Jun 20 '17 at 09:28