30

I have a time series in pandas that looks like this:

                     Values
1992-08-27 07:46:48    28.0  
1992-08-27 08:00:48    28.2  
1992-08-27 08:33:48    28.4  
1992-08-27 08:43:48    28.8  
1992-08-27 08:48:48    29.0  
1992-08-27 08:51:48    29.2  
1992-08-27 08:53:48    29.6  
1992-08-27 08:56:48    29.8  
1992-08-27 09:03:48    30.0

I would like to resample it to a regular time series with 15 min times steps where the values are linearly interpolated. Basically I would like to get:

                     Values
1992-08-27 08:00:00    28.2  
1992-08-27 08:15:00    28.3  
1992-08-27 08:30:00    28.4  
1992-08-27 08:45:00    28.8  
1992-08-27 09:00:00    29.9

However using the resample method (df.resample('15Min')) from Pandas I get:

                     Values
1992-08-27 08:00:00   28.20  
1992-08-27 08:15:00     NaN  
1992-08-27 08:30:00   28.60  
1992-08-27 08:45:00   29.40  
1992-08-27 09:00:00   30.00  

I have tried the resample method with different how and fill_method parameters but never got exactly the results I wanted. Am I using the wrong method?

wjandrea
  • 28,235
  • 9
  • 60
  • 81
Diane
  • 303
  • 1
  • 3
  • 4

4 Answers4

23

You can do this with traces. First, create a TimeSeries with your irregular measurements like you would a dictionary:

ts = traces.TimeSeries([
    (datetime(1992, 8, 27, 7, 46, 48), 28.0),
    (datetime(1992, 8, 27, 8, 0, 48), 28.2),
    ...
    (datetime(1992, 8, 27, 9, 3, 48), 30.0),
])

Then regularize using the sample method:

ts.sample(
    sampling_period=timedelta(minutes=15),
    start=datetime(1992, 8, 27, 8),
    end=datetime(1992, 8, 27, 9),
    interpolate='linear',
)

This results in the following regularized version, where the gray dots are the original data and the orange is the regularized version with linear interpolation.

time series with linear interpolation

The interpolated values are:

1992-08-27 08:00:00    28.189 
1992-08-27 08:15:00    28.286  
1992-08-27 08:30:00    28.377
1992-08-27 08:45:00    28.848
1992-08-27 09:00:00    29.891
mstringer
  • 2,242
  • 3
  • 25
  • 36
  • thanks, I made it with [xmgrace](http://plasma-gate.weizmann.ac.il/Grace/) — old school :) – mstringer Sep 27 '16 at 17:26
  • 1
    @mstringer This method is a god send! Thank you for sharing this with us! – 3kstc Nov 06 '16 at 23:23
  • 1
    @mstringer Thank you for traces! This method is really helpful for unevenly spaced time series. Can you please tell me how did you make the above graph? Did you use xmgrace? Are you aware of any libraries which will help me recreate the above in Python? – Shefali Agarwal Jul 07 '20 at 20:25
17

The same result that @mstringer gets can be achieved purely in pandas. The trick is to first resample by second, using interpolation to fill in the intermediate values (.resample('s').interpolate()), and then upsample in 15-minute periods (.resample('15T').asfreq()).

import io
import pandas as pd

data = io.StringIO('''\
Values
1992-08-27 07:46:48,28.0  
1992-08-27 08:00:48,28.2  
1992-08-27 08:33:48,28.4  
1992-08-27 08:43:48,28.8  
1992-08-27 08:48:48,29.0  
1992-08-27 08:51:48,29.2  
1992-08-27 08:53:48,29.6  
1992-08-27 08:56:48,29.8  
1992-08-27 09:03:48,30.0
''')
s = pd.read_csv(data).squeeze('columns')
s.index = pd.to_datetime(s.index)

res = s.resample('s').interpolate().resample('15T').asfreq().dropna()
print(res)

Output:

1992-08-27 08:00:00    28.188571
1992-08-27 08:15:00    28.286061
1992-08-27 08:30:00    28.376970
1992-08-27 08:45:00    28.848000
1992-08-27 09:00:00    29.891429
Freq: 15T, Name: Values, dtype: float64
wjandrea
  • 28,235
  • 9
  • 60
  • 81
Alicia Garcia-Raboso
  • 13,193
  • 1
  • 43
  • 48
  • 1
    Inefficient but still clever and useful. – Denziloe Jul 10 '17 at 10:46
  • can i ask what to do in the case that i have a "string" column and I would like to duplicate the value for 8:00:00 for the time period 8:00:00 to 8:45:00 to the in-between values 8:15:00 and 8:30:00? – experiment unit 1998X Dec 24 '21 at 10:43
  • You're using seconds because that's the finest resolution of the input data, right? I'm just thinking, for a dataset where the finest resolution is hours, resampling to seconds would be unnecessary. – wjandrea Oct 14 '22 at 19:52
  • @experimentunit1998X If I understand correctly, you're looking for [`.ffill()`](https://pandas.pydata.org/docs/reference/api/pandas.core.resample.Resampler.ffill.html), like `s.resample('s').ffill().resample('15T')...` – wjandrea Oct 14 '22 at 19:55
9

It takes a bit of work, but try this out. Basic idea is find the closest two timestamps to each resample point and interpolate. np.searchsorted is used to find dates closest to the resample point.

# empty frame with desired index
rs = pd.DataFrame(index=df.resample('15min').iloc[1:].index)

# array of indexes corresponding with closest timestamp after resample
idx_after = np.searchsorted(df.index.values, rs.index.values)

# values and timestamp before/after resample
rs['after'] = df.loc[df.index[idx_after], 'Values'].values
rs['before'] = df.loc[df.index[idx_after - 1], 'Values'].values
rs['after_time'] = df.index[idx_after]
rs['before_time'] = df.index[idx_after - 1]

#calculate new weighted value
rs['span'] = (rs['after_time'] - rs['before_time'])
rs['after_weight'] = (rs['after_time'] - rs.index) / rs['span']
# I got errors here unless I turn the index to a series
rs['before_weight'] = (pd.Series(data=rs.index, index=rs.index) - rs['before_time']) / rs['span']

rs['Values'] = rs.eval('before * before_weight + after * after_weight')

After all that, hopefully the right answer:

In [161]: rs['Values']
Out[161]: 
1992-08-27 08:00:00    28.011429
1992-08-27 08:15:00    28.313939
1992-08-27 08:30:00    28.223030
1992-08-27 08:45:00    28.952000
1992-08-27 09:00:00    29.908571
Freq: 15T, Name: Values, dtype: float64
chrisb
  • 49,833
  • 8
  • 70
  • 70
  • Great! I've just changed the last line to: rs['Values'] = rs.eval('after * before_weight + before * after_weight') and it is now doing the linear interpolation like I wanted. Thank you. – Diane Aug 12 '14 at 03:19
8

I recently had to resample acceleration data that was non-uniformly sampled. It was generally sampled at the correct frequency, but had delays intermittently that accumulated.

I found this question and combined mstringer's and Alberto Garcia-Rabosco's answers using pure pandas and numpy. This method creates a new index at the desired frequency and then interpolates without the intermittent step of interpolating at higher frequency.

# from Alberto Garcia-Rabosco above
import io
import pandas as pd

data = io.StringIO('''\
Values
1992-08-27 07:46:48,28.0  
1992-08-27 08:00:48,28.2  
1992-08-27 08:33:48,28.4  
1992-08-27 08:43:48,28.8  
1992-08-27 08:48:48,29.0  
1992-08-27 08:51:48,29.2  
1992-08-27 08:53:48,29.6  
1992-08-27 08:56:48,29.8  
1992-08-27 09:03:48,30.0
''')
s = pd.read_csv(data, squeeze=True)
s.index = pd.to_datetime(s.index)

Code to do the interpolation:

import numpy as np
# create the new index and a new series full of NaNs
new_index = pd.DatetimeIndex(start='1992-08-27 08:00:00', 
    freq='15 min', periods=5, yearfirst=True)
new_series = pd.Series(np.nan, index=new_index)

# concat the old and new series and remove duplicates (if any) 
comb_series = pd.concat([s, new_series])
comb_series = comb_series[~comb_series.index.duplicated(keep='first')]

# interpolate to fill the NaNs
comb_series.interpolate(method='time', inplace=True)

Output:

>>> print(comb_series[new_index])
1992-08-27 08:00:00    28.188571
1992-08-27 08:15:00    28.286061
1992-08-27 08:30:00    28.376970
1992-08-27 08:45:00    28.848000
1992-08-27 09:00:00    29.891429
Freq: 15T, dtype: float64

As before, you can use whatever interpolation method that scipy supports and this technique works with DataFrames as well (that is what I originally used it for). Finally, note that interpolate defaults to the 'linear' method which ignores the time information in the index and will not work with non-uniformly spaced data.

BE-Bob
  • 81
  • 1
  • 3
  • I got `E TypeError: __new__() got an unexpected keyword argument 'start'` with the `pd.DatetimeIndex(start='1992-08-27 08:00:00'` line. I found that using `pd.date_range()` instead worked - Python 3.8 and pandas 1.3.5 – Clare Macrae Jan 21 '22 at 18:56
  • Just tried this. After concatenation the index was not monotone but I was pleasantly surprised that the interpolation did indeed solve the problem correctly. – Tunneller Sep 28 '22 at 17:28