21

I am using Python Pandas for the first time. I have 5-min lag traffic data in csv format:

...
2015-01-04 08:29:05,271238
2015-01-04 08:34:05,329285
2015-01-04 08:39:05,-1
2015-01-04 08:44:05,260260
2015-01-04 08:49:05,263711
...

There are several issues:

  • for some timestamps there's missing data (-1)
  • missing entries (also 2/3 consecutive hours)
  • the frequency of the observations is not exactly 5 minutes, but actually loses some seconds once in a while

I would like to obtain a regular time series, so with entries every (exactly) 5 minutes (and no missing valus). I have successfully interpolated the time series with the following code to approximate the -1 values with this code:

ts = pd.TimeSeries(values, index=timestamps)
ts.interpolate(method='cubic', downcast='infer')

How can I both interpolate and regularize the frequency of the observations? Thank you all for the help.

riccamini
  • 1,161
  • 1
  • 13
  • 29

1 Answers1

29

Change the -1s to NaNs:

ts[ts==-1] = np.nan

Then resample the data to have a 5 minute frequency.

ts = ts.resample('5T')

Note that, by default, if two measurements fall within the same 5 minute period, resample averages the values together.

Finally, you could linearly interpolate the time series according to the time:

ts = ts.interpolate(method='time')

Since it looks like your data already has roughly a 5-minute frequency, you might need to resample at a shorter frequency so cubic or spline interpolation can smooth out the curve:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

values = [271238, 329285, -1, 260260, 263711]
timestamps = pd.to_datetime(['2015-01-04 08:29:05',
                             '2015-01-04 08:34:05',
                             '2015-01-04 08:39:05',
                             '2015-01-04 08:44:05',
                             '2015-01-04 08:49:05'])

ts = pd.Series(values, index=timestamps)
ts[ts==-1] = np.nan
ts = ts.resample('T').mean()

ts.interpolate(method='spline', order=3).plot()
ts.interpolate(method='time').plot()
lines, labels = plt.gca().get_legend_handles_labels()
labels = ['spline', 'time']
plt.legend(lines, labels, loc='best')
plt.show()

enter image description here

unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • Thank you it works perfectly! There's a way I can instead first add the regular 5 minutes timestamps to the series with nan as values, and then interpolate them with an order 3 spline? – riccamini May 29 '15 at 15:59
  • I don't understand what you mean by "add the regular 5 minutes timestamps to the series with nan as values", but I've added an example showing interpolation by `time` and with order-3 splines. – unutbu May 29 '15 at 16:52
  • I mean, there would be any difference if instead linearly interpolate the time series with **resample()**, we build first a time series with the plain entries in the data, then we add entries of the type 2015-01-01-08:00, nan 2015-01-01-08:05, nan and so on, and finally interpolate them with an order 3 spline? – riccamini May 29 '15 at 17:22
  • 1
    The resampling is done *before* and independent of the interpolation. You don't have to interpolate linearly. Just remove the line `ts.interpolate(method='time').plot()` and the associated `time` from `labels`. Then the above code interpolates the data with an order-3 spline alone. – unutbu May 29 '15 at 19:51
  • this answer is super helpful, more than I can express with a +1 – Tommy Mar 23 '16 at 21:08
  • For my monthly data, I had use "ts = pd.Series(values, index=timestamps, dtype=float)" for it to work. – user14241 Nov 28 '16 at 20:01
  • OK, so the 'T' in `ts.resample(rule='T',...)` comes from the aliases in http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases. The `how='mean'` speaks to the new Series datetime index is calculated from the source data's time index. And the actual interpolation of the data at the times is done in the `.interpolate(...)` that is passed to the `.plot()` method and then thrown away. -- I find this procedure darn had to discover from the docs. I guess the general method is to induce NANs by the resampling the time index, and then fill in the NANs with the interpolate()? – Dave X Mar 23 '18 at 13:59
  • @DaveX: Yes! You've got it. – unutbu Mar 23 '18 at 15:15
  • 1
    `how='mean'` tells `resample` how to *aggregate* values if multiple rows fall within the same period (in this case `T` indicates that the period frequency is 1 minute long.) – unutbu Mar 23 '18 at 15:18
  • `Series.resample` also does missing-value fill-ins for upsampling with `fill_method={'pad'|'fill'|'bfill'}` but that parameter won't take the interpolation methods, and would interfere with a later interpolation since the np.nans are filled in. – Dave X Mar 23 '18 at 18:58
  • Is there a way to interpolate a time series onto a different, arbitrary set of potentially irregular times with this? – Dave X Mar 23 '18 at 19:07
  • 2
    @DaveX: Certainly. Instead of `resample`-ing the Series, use [`reindex`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.reindex.html) to add new rows with `NaN` values. Then calling `Series.interpolate(method='time')` will fill in the missing values with interpolated values. – unutbu Mar 23 '18 at 19:17
  • 2
    Or you could use `combine_first` as [HYRY shows here](https://stackoverflow.com/a/23772571/190597). (As I think about it,`combine_first` may be the better solution since you wouldn't have to union the old index with a new index as you would if you use `reindex`...) – unutbu Mar 23 '18 at 19:21