Creating a custom interpolation function for pandas

Question

I am currently trying to clean up and fill in some missing time-series data using pandas. The interpolate function works quite well, however it doesn't have a few (less widely used) interpolation functions that I require for my data set. A couple examples would be a simple "last" valid data point which would create something akin to a step function, or something like a logarithmic or geometric interpolation.

Browsing through the docs, it didn't appear there is a way to pass a custom interpolation function. Does such functionality exist directly within pandas? And if not, has anyone done any pandas-fu to efficiently apply custom interpolations through other means?

For the specific case of reusing the last valid value, `ffill` is what you would use. Generally, you could use `apply` for such purposes or just do some magic with individual series and reassign them to you data frame. What else are you exactly missing? — languitar, Jan 27 '17 at 14:12
The specific issue- my data set isn't exactly "clean" in its missing data. There might be 1 or 2 values missing here or there, then a thousand good values, then a chunk of 20 missing values. Identifying those boundaries and applying a function that takes as inputs the non-missing value before and the non-missing value after is what is hanging me up. — MarkD, Jan 27 '17 at 14:21

score 4 · Accepted Answer · answered Jan 27 '17 at 14:52

The interpolation methods offered by Pandas are those offered by scipy.interpolate.interp1d - which, unfortunately, do not seem to be extendable in any way. I had to do something like that to apply SLERP quaternion interpolation (using numpy-quaternion), and I managed to do it quite efficiently. I'll copy the code here in the hope that you can adapt it for your purposes:

def interpolate_slerp(data):
    if data.shape[1] != 4:
        raise ValueError('Need exactly 4 values for SLERP')
    vals = data.values.copy()
    # quaternions has size Nx1 (each quaternion is a scalar value)
    quaternions = quaternion.as_quat_array(vals)
    # This is a mask of the elements that are NaN
    empty = np.any(np.isnan(vals), axis=1)
    # These are the positions of the valid values
    valid_loc = np.argwhere(~empty).squeeze(axis=-1)
    # These are the indices (e.g. time) of the valid values
    valid_index = data.index[valid_loc].values
    # These are the valid values
    valid_quaternions = quaternions[valid_loc]
    # Positions of the missing values
    empty_loc = np.argwhere(empty).squeeze(axis=-1)
    # Missing values before first or after last valid are discarded
    empty_loc = empty_loc[(empty_loc > valid_loc.min()) & (empty_loc < valid_loc.max())]
    # Index value for missing values
    empty_index = data.index[empty_loc].values
    # Important bit! This tells you the which valid values must be used as interpolation ends for each missing value
    interp_loc_end = np.searchsorted(valid_loc, empty_loc)
    interp_loc_start = interp_loc_end - 1
    # These are the actual values of the interpolation ends
    interp_q_start = valid_quaternions[interp_loc_start]
    interp_q_end = valid_quaternions[interp_loc_end]
    # And these are the indices (e.g. time) of the interpolation ends
    interp_t_start = valid_index[interp_loc_start]
    interp_t_end = valid_index[interp_loc_end]
    # This performs the actual interpolation
    # For each missing value, you have:
    #   * Initial interpolation value
    #   * Final interpolation value
    #   * Initial interpolation index
    #   * Final interpolation index
    #   * Missing value index
    interpolated = quaternion.slerp(interp_q_start, interp_q_end, interp_t_start, interp_t_end, empty_index)
    # This puts the interpolated values into place
    data = data.copy()
    data.iloc[empty_loc] = quaternion.as_float_array(interpolated)
    return data

The trick is in np.searchsorted, which very quickly finds the right interpolation ends for each value. The limitation of this method is that:

Your interpolation function must work somewhat like quaternion.slerp (which should not be strange since it has regular ufunc broadcasting behaviour).
It only works for interpolation methods that require only one value on each end, so if you want e.g. something like a cubic interpolation (which you don't because that one is already provided) this wouldn't work.

It is oversimplified to say one wouldn't want to do a cubic spline -- I'm here specifically because I have a specific local monotonicity preserving spline in mind. — Eli S, Oct 25 '19 at 04:55
@EliS So you want a cubic interpolation different to what is already provided by SciPy? Maybe you can open a new question about what you exactly need (and if you want point out why this answer doesn't work for you). — jdehesa, Oct 25 '19 at 09:18

score 3 · Answer 2 · answered Jan 27 '17 at 14:31

In order to find the blocks of missing data inside a Series you can do something along the lines of Finding consecutive segments in a pandas data frame:

s = pd.Series([1, 2, np.nan, np.nan, 5, 6, np.nan, np.nan, np.nan, 10])
x = s.isnull().reset_index(name='null')
# computes unique numbers for each block of consecutive nan/non-nan values
x['block'] = (x['null'].shift(1) != x['null']).astype(int).cumsum()
# select those blocks that relate to null values
x[x['null']].groupby('block')['index'].apply(np.array)

This will result in the following series where the values are arrays of all index-entries containing nan values for each block:

block
2       [2, 3]
4    [6, 7, 8]
Name: index, dtype: object

You can iterate over these and apply custom fixing logic. Getting values before and after should be easy then.

Creating a custom interpolation function for pandas

2 Answers2

Linked