3

I am currently trying to clean up and fill in some missing time-series data using pandas. The interpolate function works quite well, however it doesn't have a few (less widely used) interpolation functions that I require for my data set. A couple examples would be a simple "last" valid data point which would create something akin to a step function, or something like a logarithmic or geometric interpolation.

Browsing through the docs, it didn't appear there is a way to pass a custom interpolation function. Does such functionality exist directly within pandas? And if not, has anyone done any pandas-fu to efficiently apply custom interpolations through other means?

MarkD
  • 4,864
  • 5
  • 36
  • 67
  • For the specific case of reusing the last valid value, `ffill` is what you would use. Generally, you could use `apply` for such purposes or just do some magic with individual series and reassign them to you data frame. What else are you exactly missing? – languitar Jan 27 '17 at 14:12
  • The specific issue- my data set isn't exactly "clean" in its missing data. There might be 1 or 2 values missing here or there, then a thousand good values, then a chunk of 20 missing values. Identifying those boundaries and applying a function that takes as inputs the non-missing value before and the non-missing value after is what is hanging me up. – MarkD Jan 27 '17 at 14:21

2 Answers2

4

The interpolation methods offered by Pandas are those offered by scipy.interpolate.interp1d - which, unfortunately, do not seem to be extendable in any way. I had to do something like that to apply SLERP quaternion interpolation (using numpy-quaternion), and I managed to do it quite efficiently. I'll copy the code here in the hope that you can adapt it for your purposes:

def interpolate_slerp(data):
    if data.shape[1] != 4:
        raise ValueError('Need exactly 4 values for SLERP')
    vals = data.values.copy()
    # quaternions has size Nx1 (each quaternion is a scalar value)
    quaternions = quaternion.as_quat_array(vals)
    # This is a mask of the elements that are NaN
    empty = np.any(np.isnan(vals), axis=1)
    # These are the positions of the valid values
    valid_loc = np.argwhere(~empty).squeeze(axis=-1)
    # These are the indices (e.g. time) of the valid values
    valid_index = data.index[valid_loc].values
    # These are the valid values
    valid_quaternions = quaternions[valid_loc]
    # Positions of the missing values
    empty_loc = np.argwhere(empty).squeeze(axis=-1)
    # Missing values before first or after last valid are discarded
    empty_loc = empty_loc[(empty_loc > valid_loc.min()) & (empty_loc < valid_loc.max())]
    # Index value for missing values
    empty_index = data.index[empty_loc].values
    # Important bit! This tells you the which valid values must be used as interpolation ends for each missing value
    interp_loc_end = np.searchsorted(valid_loc, empty_loc)
    interp_loc_start = interp_loc_end - 1
    # These are the actual values of the interpolation ends
    interp_q_start = valid_quaternions[interp_loc_start]
    interp_q_end = valid_quaternions[interp_loc_end]
    # And these are the indices (e.g. time) of the interpolation ends
    interp_t_start = valid_index[interp_loc_start]
    interp_t_end = valid_index[interp_loc_end]
    # This performs the actual interpolation
    # For each missing value, you have:
    #   * Initial interpolation value
    #   * Final interpolation value
    #   * Initial interpolation index
    #   * Final interpolation index
    #   * Missing value index
    interpolated = quaternion.slerp(interp_q_start, interp_q_end, interp_t_start, interp_t_end, empty_index)
    # This puts the interpolated values into place
    data = data.copy()
    data.iloc[empty_loc] = quaternion.as_float_array(interpolated)
    return data

The trick is in np.searchsorted, which very quickly finds the right interpolation ends for each value. The limitation of this method is that:

  • Your interpolation function must work somewhat like quaternion.slerp (which should not be strange since it has regular ufunc broadcasting behaviour).
  • It only works for interpolation methods that require only one value on each end, so if you want e.g. something like a cubic interpolation (which you don't because that one is already provided) this wouldn't work.
jdehesa
  • 58,456
  • 7
  • 77
  • 121
  • It is oversimplified to say one wouldn't want to do a cubic spline -- I'm here specifically because I have a specific local monotonicity preserving spline in mind. – Eli S Oct 25 '19 at 04:55
  • @EliS So you want a cubic interpolation different to what is already provided by SciPy? Maybe you can open a new question about what you exactly need (and if you want point out why this answer doesn't work for you). – jdehesa Oct 25 '19 at 09:18
3

In order to find the blocks of missing data inside a Series you can do something along the lines of Finding consecutive segments in a pandas data frame:

s = pd.Series([1, 2, np.nan, np.nan, 5, 6, np.nan, np.nan, np.nan, 10])
x = s.isnull().reset_index(name='null')
# computes unique numbers for each block of consecutive nan/non-nan values
x['block'] = (x['null'].shift(1) != x['null']).astype(int).cumsum()
# select those blocks that relate to null values
x[x['null']].groupby('block')['index'].apply(np.array)

This will result in the following series where the values are arrays of all index-entries containing nan values for each block:

block
2       [2, 3]
4    [6, 7, 8]
Name: index, dtype: object

You can iterate over these and apply custom fixing logic. Getting values before and after should be easy then.

languitar
  • 6,554
  • 2
  • 37
  • 62