1

I am looking fopr a way to interpolate only over short gaps in a Pandas DataFrame that has a DateTimeIndex. Long gaps should be kept as they are.

df = pd.DataFrame(
    { "value": [ 1, np.nan, 3, np.nan, np.nan, 5, np.nan, 11, np.nan, 21, np.nan, 41 ] },
    index=pd.to_datetime( [ 
        "2021-01-01 00:00", "2021-01-01 00:05", "2021-01-01 00:10",
        "2021-01-01 00:11", "2021-01-01 00:13", "2021-01-01 00:14",
        "2021-01-01 00:15", "2021-01-01 01:30", "2021-01-01 03:00",
        "2021-01-01 04:00", "2021-01-01 05:45", "2021-01-01 06:45",
    ] )
)
                     value
2021-01-01 00:00:00    1.0
2021-01-01 00:05:00    NaN
2021-01-01 00:10:00    3.0
2021-01-01 00:11:00    NaN
2021-01-01 00:13:00    NaN
2021-01-01 00:14:00    5.0
2021-01-01 00:15:00    NaN
2021-01-01 01:30:00   11.0
2021-01-01 03:00:00    NaN
2021-01-01 04:00:00   21.0
2021-01-01 05:45:00    NaN
2021-01-01 06:45:00   41.0

The idea is to keep gaps that are longer than a certain time (>5 minutes in this case), but interpolate all missing values within shorter gaps.

interpolate() has a limit argument that limits the number of missing values to be interpolated, but this does not respect the time delta between the rows, only the number of rows.

I would like the result to be like this:

                         value
2021-01-01 00:00:00   1.000000
2021-01-01 00:05:00   2.000000
2021-01-01 00:10:00   3.000000
2021-01-01 00:11:00   3.500000
2021-01-01 00:13:00   4.500000
2021-01-01 00:14:00   5.000000
2021-01-01 00:15:00        NaN
2021-01-01 01:30:00  11.000000
2021-01-01 03:00:00        NaN
2021-01-01 04:00:00  21.000000
2021-01-01 05:45:00        NaN
2021-01-01 06:45:00  41.000000
leviathan
  • 363
  • 3
  • 10
  • Shouldn't the gap between 00:10:00 and 01:30:00 (80 minutes - which includes minute 11 and 13) be bypassed? Though from those two entries it appears to me that the interpolation needs to happens with respect to the amount of time that has passed and the length of the time gap - not just by number of entries. True? – jch Jun 29 '21 at 22:58
  • You are correct. I edited the example. – leviathan Jun 30 '21 at 05:44

1 Answers1

1

This solution fills value gaps that are in time spans that are less than a specified value. The filled values are set proportionally to the entry's position within the value gap's time span (time-interpolated values). Julian dates are used for easier computation.

Set max time span gap to fill with time-interpolated values. 5 minutes.

jd_max_gap_fill = 5/(60*24)

Calculate the value gap:

df['ffill'] = df['value'].ffill()
df['value_gap'] = df['value'].bfill() - df['value'].ffill()

Get the Julian date for the entry:

df['jd'] = df.index.to_julian_date()

Calculate the time gap:

df['jd_nan'] = np.where(~df['value'].isna(), df['jd'], np.nan)
df['jd_gap'] = df['jd_nan'].bfill() - df['jd_nan'].ffill()

Time-wise, calculate how far into the value gap we are:

df['jd_start'] = df['jd_nan'].ffill() 
df['jd_prp'] = np.where(df['jd_gap'] != 0, (df['jd'] - df['jd_start'])/df['jd_gap'], 0)

Calculate time-interpolated values:

df['filled_value'] = np.where(df['jd_gap'] <= jd_max_gap_fill, df['ffill'] + df['value_gap'] * df['jd_prp'], np.nan) 

df['filled_value']

2021-01-01 00:00:00     1.0
2021-01-01 00:05:00     NaN
2021-01-01 00:10:00     3.0
2021-01-01 00:11:00     3.5
2021-01-01 00:13:00     4.5
2021-01-01 00:14:00     5.0
2021-01-01 00:15:00     NaN
2021-01-01 01:30:00    11.0
2021-01-01 03:00:00     NaN
2021-01-01 04:00:00    21.0
2021-01-01 05:45:00     NaN
2021-01-01 06:45:00    41.0

Note that my output is different than your expected output because the first NaN is in a 10 minute gap.

jch
  • 3,600
  • 1
  • 15
  • 17