2

I am using Pandas to interpolate datapoints in time, however when resampling and interpolating, I get different results for the same interpolated time when using different resampling rates.

Here is a test example:

import pandas as pd
import datetime

data = pd.DataFrame({'time': list(map(lambda a: datetime.datetime.strptime(a, '%Y-%m-%d %H:%M:%S'),
                                                ['2021-03-28 12:00:00', '2021-03-28 12:01:40',
                                                 '2021-03-28 12:03:20', '2021-03-28 12:05:00',
                                                 '2021-03-28 12:06:40', '2021-03-28 12:08:20',
                                                 '2021-03-28 12:10:00', '2021-03-28 12:11:40',
                                                 '2021-03-28 12:13:20', '2021-03-28 12:15:00'])),
                     'latitude': [44.0, 44.00463175663968, 44.00919766508212,
                                  44.01357245844425, 44.0176360866699, 44.02127701531401,
                                  44.02439529286458, 44.02690530159084, 44.02873811544965,
                                  44.02984339933479],
                     'longitude': [-62.75, -62.74998054893869, -62.748902164559304,
                                   -62.74679419470262, -62.7437142666763, -62.739746727555016,
                                   -62.735000345048086, -62.72960533041183, -62.72370976436673,
                                   -62.717475524320704]})

data.set_index('time', inplace=True)

a = data.resample('20s').interpolate(method='time')
b = data.resample('60s').interpolate(method='time')

print(a.iloc[:18:3])
print(b.iloc[:6])

# --- OUTPUT --- #

                      latitude  longitude
time                                     
2021-03-28 12:00:00  44.000000 -62.750000
2021-03-28 12:01:00  44.002779 -62.749988  # <-- Different Values
2021-03-28 12:02:00  44.005545 -62.749765  # <-- Different Values
2021-03-28 12:03:00  44.008284 -62.749118  # <-- Different Values
2021-03-28 12:04:00  44.010948 -62.748059  # <-- Different Values
2021-03-28 12:05:00  44.013572 -62.746794
                      latitude  longitude
time                                     
2021-03-28 12:00:00  44.000000 -62.750000
2021-03-28 12:01:00  44.002714 -62.749359  # <-- Different Values
2021-03-28 12:02:00  44.005429 -62.748718  # <-- Different Values
2021-03-28 12:03:00  44.008143 -62.748077  # <-- Different Values
2021-03-28 12:04:00  44.010858 -62.747435  # <-- Different Values
2021-03-28 12:05:00  44.013572 -62.746794 

The a dataframe and b dataframe should predict the same value on the minute, however in most cases they differ at this time.

Does anyone know what could be causing this? When plotting the full results, it looks like resampling on the minute causes pandas to ignore data in timestamps that are not on the minute (12:01:40 and 12:03:20 for example).

  • 1
    compare `data.resample('60s').asfreq()` to `data.resample('20s').asfreq()` to see which points are used for interpolation. All your input sample fits into the 20 s grid but only few points fit in the 60s grid. – FObersteiner Apr 06 '21 at 11:57
  • Perhaps I'm just using the tools incorrectly. My hope it to use *all* of the data to upsample to 20s and 60s respectively. Is there a way to tell Pandas to resample at a particular time interval, but *also* interpolate the data at that time based on the original dataframe? Linear interpolation is 'good enough' for me right now. – mystery_lektro Apr 06 '21 at 12:05
  • 1
    I wouldn't blame you using the tools wrongly - it isn't obvious what's happening if you ask me! I think [option #1 here](https://stackoverflow.com/a/47148740/10197418) does the trick - reindex and interpolate instead of resample. `numpy.interp` also works as [shown here](https://stackoverflow.com/a/54276378/10197418). – FObersteiner Apr 06 '21 at 12:12
  • That did the trick! Thank you so much for your help :) If you post this as an answer, I will accept it for others to quickly see. – mystery_lektro Apr 06 '21 at 12:27

1 Answers1

3

A wrap-up of my comments and some explanation:

You can observe what's happening if you compare data.resample('60s').asfreq() to data.resample('20s').asfreq(). While all of the sample data fits into the 20s grid, only few values remain in the 60s grid. pandas resample interpolate is producing NaNs basically describes the problem.

The point is, pandas resamples and then interpolates. If resampling leads to loss of data, those data is not available for interpolation. If you want to make use of all the data you have initially, you'll want to interpolate and then reset the index. You can do so like

# let's create new indices, the desired index...
new_index_20s = pd.date_range(data.index.min(), data.index.max(), freq='20s')
# and a helper for interpolation; the combination of existing and desired index
tmp_index_20s = data.index.union(new_index_20s)

new_index_60s = pd.date_range(data.index.min(), data.index.max(), freq='60s')
tmp_index_60s = data.index.union(new_index_20s)

# re-index to the helper index,
# interpolate,
# and re-index to desired index 
a1 = data.reindex(tmp_index_20s).interpolate('index').reindex(new_index_20s)
b1 = data.reindex(tmp_index_60s).interpolate('index').reindex(new_index_60s)

Now you have agreement in the resulting time series:

print(a1.iloc[:18:3])
print(b1.iloc[:6])
                      latitude  longitude
2021-03-28 12:00:00  44.000000 -62.750000
2021-03-28 12:01:00  44.002779 -62.749988
2021-03-28 12:02:00  44.005545 -62.749765
2021-03-28 12:03:00  44.008284 -62.749118
2021-03-28 12:04:00  44.010948 -62.748059
2021-03-28 12:05:00  44.013572 -62.746794
                      latitude  longitude
2021-03-28 12:00:00  44.000000 -62.750000
2021-03-28 12:01:00  44.002779 -62.749988
2021-03-28 12:02:00  44.005545 -62.749765
2021-03-28 12:03:00  44.008284 -62.749118
2021-03-28 12:04:00  44.010948 -62.748059
2021-03-28 12:05:00  44.013572 -62.746794
FObersteiner
  • 22,500
  • 8
  • 42
  • 72