11

pandas.DataFrame.interpolate allows to fill missing data by interpolating neighboring values. Among the arguments it accepts, two of them seem relevant for this question: method and limit.

  • method: among other possible values, accepts "linear" and "time". The difference between them is that "linear" assumes equidistant rows and ignores the index, while "time" interpolates taking into account the time intervals defined by the index
>>> df = pd.DataFrame({'vals': [11, np.nan, np.nan, 12, np.nan, 22]},
...                   index=pd.to_datetime(['2020-02-02 11:00', '2020-02-02 11:06', '2020-02-02 11:30',
...                                         '2020-02-02 12:00', '2020-02-02 16:00', '2020-02-02 22:00']))
>>> df.assign(interp_linear=df.interpolate(method='linear'),
...           interp_time=df.interpolate(method='time'))

                     vals  interp_linear  interp_time
2020-02-02 11:00:00  11.0      11.000000         11.0
2020-02-02 11:06:00   NaN      11.333333         11.1
2020-02-02 11:30:00   NaN      11.666667         11.5
2020-02-02 12:00:00  12.0      12.000000         12.0
2020-02-02 16:00:00   NaN      17.000000         16.0
2020-02-02 22:00:00  22.0      22.000000         22.0
  • limit: allows you to limit the length of the gaps to be filled with interpolated values in terms of the number of missing values. It expects integer values and is defined as the maximum number of consecutive NaNs to fill. While this behavior is absolutely reasonable for the method="linear" case, it seems limited for the method="time" case.

Question

When filling missing values with interpolation, is there any way to limit the length of the gap to be filled by specifying the maximum temporal gap rather than the number of missing values?

Specifically, I would like a behaviour equivalent to interpolate with method='time' but filling only missing values that are not further away than a given timedelta from the last and the next non-missing values. An alternative criteria could also be filling all missing values surrounded by non-missing values that are not further away than a given timedelta. Subtly different options, but both of them good.

Something equivalent to this:

>>> df.interpolate(method='time', limit='2h')

                     vals
2020-02-02 11:00:00  11.0
2020-02-02 11:06:00  11.1
2020-02-02 11:30:00  11.5
2020-02-02 12:00:00  12.0
2020-02-02 16:00:00   NaN
2020-02-02 22:00:00  22.0
mgab
  • 3,147
  • 4
  • 19
  • 30
  • If that last gap had more intermediate times, like 12:00:00, 13:00:00. 16:00:00, 22:00:00 (12 and 22 have values, 13 and 16 are null) would you still want the 13:00:00 to be filled because it's within '2h' of 12:00:00, which has a value, or should it stay null because the entire span 12:00:00 - 22:00:00 is longer than '2h'? – ALollz Jun 26 '20 at 15:42
  • The first behavior you describe would be equivalent to what happens with `limit` in terms of number of missing values. However, for my use case I'd prefer that if the span is larger than the limit all missing values within it stay null. – mgab Jun 30 '20 at 11:07
  • Related: https://stackoverflow.com/q/30533021/3372061 – Dev-iL Apr 25 '21 at 17:20

1 Answers1

1

One can create and use a mask to pick off the values required:

df = pd.DataFrame({'vals': [11, np.nan, np.nan, 12, np.nan, 22]},
                   index=pd.to_datetime(['2020-02-02 11:00', '2020-02-02 11:06', '2020-02-02 11:30',
                                         '2020-02-02 12:00', '2020-02-02 16:00', '2020-02-02 22:00']))
df['ts'] = df.index
mask = df['ts'].diff() < datetime.timedelta(hours=2)
df["masked_interp"] = df["vals"].interpolate(method='time')[mask]
df["vals"] = df["vals"].combine_first(df["masked_interp"])

Giving:

                    vals
2020-02-02 11:00:00 11.0
2020-02-02 11:06:00 11.1
2020-02-02 11:30:00 11.5
2020-02-02 12:00:00 12.0
2020-02-02 16:00:00 NaN
2020-02-02 22:00:00 22.0
Chris Seeling
  • 606
  • 4
  • 11