Filling gaps in a numpy array

Question

I just want to interpolate, in the simplest possible terms, a 3D dataset. Linear interpolation, nearest neighbour, all that would suffice (this is to start off some algorithm, so no accurate estimate is required).

In new scipy versions, things like griddata would be useful, but currently I only have scipy 0.8. So I have a "cube" (data[:,:,:], (NixNjxNk)) array, and an array of flags (flags[:,:,:,], True or False) of the same size. I want to interpolate my data for the elements of data where the corresponding element of flag is False, using eg the nearest valid datapoint in data, or some linear combination of "close by" points.

There can be large gaps in the dataset in at least two dimensions. Other than coding a full-blown nearest neighbour algorithm using kdtrees or similar, I can't really find a generic, N-dimensional nearest-neighbour interpolator.

This is a very relevant post: http://stackoverflow.com/questions/3104781/inverse-distance-weighted-idw-interpolation-with-python — Jose, Apr 05 '11 at 12:29

Juh_ · Answer 1 · 2015-02-03T07:42:06.527

Using scipy.ndimage, your problem can be solved with nearest neighbor interpolation in 2 lines :

from scipy import ndimage as nd

indices = nd.distance_transform_edt(invalid_cell_mask, return_distances=False, return_indices=True)
data = data[tuple(ind)]

Now, in the form of a function:

import numpy as np
from scipy import ndimage as nd

def fill(data, invalid=None):
    """
    Replace the value of invalid 'data' cells (indicated by 'invalid') 
    by the value of the nearest valid data cell

    Input:
        data:    numpy array of any dimension
        invalid: a binary array of same shape as 'data'. 
                 data value are replaced where invalid is True
                 If None (default), use: invalid  = np.isnan(data)

    Output: 
        Return a filled array. 
    """    
    if invalid is None: invalid = np.isnan(data)

    ind = nd.distance_transform_edt(invalid, 
                                    return_distances=False, 
                                    return_indices=True)
    return data[tuple(ind)]

Exemple of use:

def test_fill(s,d):
     # s is size of one dimension, d is the number of dimension
    data = np.arange(s**d).reshape((s,)*d)
    seed = np.zeros(data.shape,dtype=bool)
    seed.flat[np.random.randint(0,seed.size,int(data.size/20**d))] = True

    return fill(data,-seed), seed

import matplotlib.pyplot as plt
data,seed  = test_fill(500,2)
data[nd.binary_dilation(seed,iterations=2)] = 0   # draw (dilated) seeds in black
plt.imshow(np.mod(data,42))                       # show cluster

result: enter image description here

Wow, I wasn't aware of `distance_transform_edt`. This is a very useful function. — Jose, Feb 27 '12 at 16:51

Paul · Accepted Answer · 2011-04-05T18:58:10.753

You can set up a crystal-growth-style algorithm shifting a view alternately along each axis, replacing only data that is flagged with a False but has a True neighbor. This gives a "nearest-neighbor"-like result (but not in Euclidean or Manhattan distance -- I think it might be nearest-neighbor if you are counting pixels, counting all connecting pixels with common corners) This should be fairly efficient with NumPy as it iterates over only axis and convergence iterations, not small slices of the data.

Crude, fast and stable. I think that's what you were after:

import numpy as np
# -- setup --
shape = (10,10,10)
dim = len(shape)
data = np.random.random(shape)
flag = np.zeros(shape, dtype=bool)
t_ct = int(data.size/5)
flag.flat[np.random.randint(0, flag.size, t_ct)] = True
# True flags the data
# -- end setup --

slcs = [slice(None)]*dim

while np.any(~flag): # as long as there are any False's in flag
    for i in range(dim): # do each axis
        # make slices to shift view one element along the axis
        slcs1 = slcs[:]
        slcs2 = slcs[:]
        slcs1[i] = slice(0, -1)
        slcs2[i] = slice(1, None)

        # replace from the right
        repmask = np.logical_and(~flag[slcs1], flag[slcs2])
        data[slcs1][repmask] = data[slcs2][repmask]
        flag[slcs1][repmask] = True

        # replace from the left
        repmask = np.logical_and(~flag[slcs2], flag[slcs1])
        data[slcs2][repmask] = data[slcs1][repmask]
        flag[slcs2][repmask] = True

For good measure, here's a visualization (2D) of the zones seeded by the data originally flagged True.

enter image description here

Nice one! A question (or two): It's not quite so obvious (at least for me) how to generalize this solution to arbitrate dimensions (1D, 2D, ...). But anyway you seem to be able to attack (quite nicely) the boundary conditions from inward side, but are you aware of any side-effects related to that approach? Care to elaborate more on these issues? Thanks — eat, Apr 06 '11 at 18:39
@eat: This *is* generalized for ND. Just change `shape` to `(10,10,10,10,10)` and see. I don't understand your second question. Could you maybe re-state it a different way? — Paul, Apr 06 '11 at 19:30
Yeah, after a while of tinkering, I'll (kind of) figured out how it could generalize to ND. Anyway, about the boundaries, I just meant that (with your solution) it seems that you don't need to be explicit on them at all (correct?). Now, since we seem to posses some substance related to OPs question, would it be nice to compare how they perform on some selected cases? Perhaps on other thread, in order to be able to fully focus on the topic. Thanks — eat, Apr 06 '11 at 20:33
@eat: No, the boundaries between filled values are not explicitly defined. Sure, I'd be happy to discuss this further. I don't know any good forums over which to collaborate. SO is probably as good a place as any (as long as you post in the form of a question!) I posted the code on pastie.org http://pastie.org/pastes/1766087 I'd be curious what "tweaks" were needed to make it ND general. I thought it already was. — Paul, Apr 07 '11 at 03:03

score 2 · Answer 3 · answered Apr 30 '15 at 21:59

Some time ago I wrote this script for my PhD: https://github.com/Technariumas/Inpainting

An example: http://blog.technariumas.lt/post/117630308826/healing-holes-in-python-arrays

Slow, but does the work. Gaussian kernel is the best choice, just check size/sigma values.

eat · Answer 4 · 2011-04-08T08:35:27.657

You may try to tackle your problem like:

# main ideas described in very high level pseudo code
choose suitable base kernel shape and type (gaussian?)
while true
    loop over your array (moving average manner)
        adapt your base kernel to current sparsity pattern
        set current value based on adapted kernel
    break if converged

This actually can be implemented quite a straightforward manner (especially if performance is not a top concern).

Obviously this is just heuristics and you need to do some experiments with your actual data to find proper adaptation scheme. When seeing kernel adaptation as kernel reweighing, you may like to do it based on how the values have been propagated. For example your weights for original supports are 1 and they decay related on which iteration they emerged.

Also the determination of when this process has actually converged may be tricky one. Depending on the application it may be reasonable eventually to leave some 'gap regions' remain 'unfilled'.

Update: Here is a very simple implementation along the lines *) described above:

from numpy import any, asarray as asa, isnan, NaN, ones, seterr
from numpy.lib.stride_tricks import as_strided as ast
from scipy.stats import nanmean

def _a2t(a):
    """Array to tuple."""
    return tuple(a.tolist())

def _view(D, shape, strides):
    """View of flattened neighbourhood of D."""
    V= ast(D, shape= shape, strides= strides)
    return V.reshape(V.shape[:len(D.shape)]+ (-1,))

def filler(A, n_shape, n_iter= 49):
    """Fill in NaNs from mean calculated from neighbour."""
    # boundary conditions
    D= NaN* ones(_a2t(asa(A.shape)+ asa(n_shape)- 1), dtype= A.dtype)
    slc= tuple([slice(n/ 2, -(n/ 2)) for n in n_shape])
    D[slc]= A

    # neighbourhood
    shape= _a2t(asa(D.shape)- asa(n_shape)+ 1)+ n_shape
    strides= D.strides* 2

    # iterate until no NaNs, but not more than n iterations
    old= seterr(invalid= 'ignore')
    for k in xrange(n_iter):
        M= isnan(D[slc])
        if not any(M): break
        D[slc][M]= nanmean(_view(D, shape, strides), -1)[M]
    seterr(**old)
    A[:]= D[slc]

And a simple demonstration of the filler(.) on action, would be something like:

In []: x= ones((3, 6, 99))
In []: x.sum(-1)
Out[]:
array([[ 99.,  99.,  99.,  99.,  99.,  99.],
       [ 99.,  99.,  99.,  99.,  99.,  99.],
       [ 99.,  99.,  99.,  99.,  99.,  99.]])
In []: x= NaN* x
In []: x[1, 2, 3]= 1
In []: x.sum(-1)
Out[]:
array([[ nan,  nan,  nan,  nan,  nan,  nan],
       [ nan,  nan,  nan,  nan,  nan,  nan],
       [ nan,  nan,  nan,  nan,  nan,  nan]])
In []: filler(x, (3, 3, 5))
In []: x.sum(-1)
Out[]:
array([[ 99.,  99.,  99.,  99.,  99.,  99.],
       [ 99.,  99.,  99.,  99.,  99.,  99.],
       [ 99.,  99.,  99.,  99.,  99.,  99.]])

*) So here the nanmean(.) is just used to demonstrate the idea of the adaptation process. Based on this demonstration, it should be quite straightforward to implement a more complex adaptation and decaying weighing scheme. Also note that, no attention is paid to actual execution performance, but it still should be good (with reasonable input shapes).

score 0 · Answer 5 · answered Apr 05 '11 at 11:57

0

Maybe what you are looking for is a machine learning algorithm, like a neural network or a support vector machine.

You may check this page, which has some links to SVM packages for python: http://web.media.mit.edu/~stefie10/technical/pythonml.html

answered Apr 05 '11 at 11:57

Simon Bergot

10,378
7
39
55

2

This is overkill for my requirements. I just need to plug some gaps in the data with a simple interpolation mechanism that is "stable". – Jose Apr 05 '11 at 12:02

Filling gaps in a numpy array

5 Answers5

Linked