Fill in missing values with nearest neighbour in Python numpy masked arrays?

Question

I am working with a 2D Numpy masked_array in Python. I need to change the data values in the masked area such that they equal the nearest unmasked value.

NB. If there are more than one nearest unmasked values then it can take any of those nearest values (which ever one turns out to be easiest to code…)

e.g.

import numpy
import numpy.ma as ma

a = numpy.arange(100).reshape(10,10)
fill_value=-99
a[2:4,3:8] = fill_value
a[8,8] = fill_value
a = ma.masked_array(a,a==fill_value)

>>> a  [[0 1 2 3 4 5 6 7 8 9]
  [10 11 12 13 14 15 16 17 18 19]
  [20 21 22 -- -- -- -- -- 28 29]
  [30 31 32 -- -- -- -- -- 38 39]
  [40 41 42 43 44 45 46 47 48 49]
  [50 51 52 53 54 55 56 57 58 59]
  [60 61 62 63 64 65 66 67 68 69]
  [70 71 72 73 74 75 76 77 78 79]
  [80 81 82 83 84 85 86 87 -- 89]
  [90 91 92 93 94 95 96 97 98 99]],

I need it to look like this:

>>> a.data
 [[0 1 2 3 4 5 6 7 8 9]
 [10 11 12 13 14 15 16 17 18 19]
 [20 21 22 ? 14 15 16 ? 28 29]
 [30 31 32 ? 44 45 46 ? 38 39]
 [40 41 42 43 44 45 46 47 48 49]
 [50 51 52 53 54 55 56 57 58 59]
 [60 61 62 63 64 65 66 67 68 69]
 [70 71 72 73 74 75 76 77 78 79]
 [80 81 82 83 84 85 86 87 ? 89]
 [90 91 92 93 94 95 96 97 98 99]],

NB. where "?" could take any of the adjacent unmasked values.

What is the most efficient way to do this?

Thanks for your help.

score 15 · Answer 1 · edited May 23 '17 at 12:00

I generally use a distance transform, as wisely suggested by Juh_ in this question.

This does not directly apply to masked arrays, but I do not think it will be that hard to transpose there, and it is quite efficient, I've had no problem applying it to large 100MPix images.

Copying the relevant method there for reference :

import numpy as np
from scipy import ndimage as nd

def fill(data, invalid=None):
    """
    Replace the value of invalid 'data' cells (indicated by 'invalid') 
    by the value of the nearest valid data cell

    Input:
        data:    numpy array of any dimension
        invalid: a binary array of same shape as 'data'. True cells set where data
                 value should be replaced.
                 If None (default), use: invalid  = np.isnan(data)

    Output: 
        Return a filled array. 
    """
    #import numpy as np
    #import scipy.ndimage as nd

    if invalid is None: invalid = np.isnan(data)

    ind = nd.distance_transform_edt(invalid, return_distances=False, return_indices=True)
    return data[tuple(ind)]

unutbu · Accepted Answer · 2010-09-07T21:52:45.350

You could use np.roll to make shifted copies of a, then use boolean logic on the masks to identify the spots to be filled in:

import numpy as np
import numpy.ma as ma

a = np.arange(100).reshape(10,10)
fill_value=-99
a[2:4,3:8] = fill_value
a[8,8] = fill_value
a = ma.masked_array(a,a==fill_value)
print(a)

# [[0 1 2 3 4 5 6 7 8 9]
#  [10 11 12 13 14 15 16 17 18 19]
#  [20 21 22 -- -- -- -- -- 28 29]
#  [30 31 32 -- -- -- -- -- 38 39]
#  [40 41 42 43 44 45 46 47 48 49]
#  [50 51 52 53 54 55 56 57 58 59]
#  [60 61 62 63 64 65 66 67 68 69]
#  [70 71 72 73 74 75 76 77 78 79]
#  [80 81 82 83 84 85 86 87 -- 89]
#  [90 91 92 93 94 95 96 97 98 99]]

for shift in (-1,1):
    for axis in (0,1):        
        a_shifted=np.roll(a,shift=shift,axis=axis)
        idx=~a_shifted.mask * a.mask
        a[idx]=a_shifted[idx]

print(a)

# [[0 1 2 3 4 5 6 7 8 9]
#  [10 11 12 13 14 15 16 17 18 19]
#  [20 21 22 13 14 15 16 28 28 29]
#  [30 31 32 43 44 45 46 47 38 39]
#  [40 41 42 43 44 45 46 47 48 49]
#  [50 51 52 53 54 55 56 57 58 59]
#  [60 61 62 63 64 65 66 67 68 69]
#  [70 71 72 73 74 75 76 77 78 79]
#  [80 81 82 83 84 85 86 87 98 89]
#  [90 91 92 93 94 95 96 97 98 99]]

If you'd like to use a larger set of nearest neighbors, you could perhaps do something like this:

neighbors=((0,1),(0,-1),(1,0),(-1,0),(1,1),(-1,1),(1,-1),(-1,-1),
           (0,2),(0,-2),(2,0),(-2,0))

Note that the order of the elements in neighbors is important. You probably want to fill in missing values with the nearest neighbor, not just any neighbor. There's probably a smarter way to generate the neighbors sequence, but I'm not seeing it at the moment.

a_copy=a.copy()
for hor_shift,vert_shift in neighbors:
    if not np.any(a.mask): break
    a_shifted=np.roll(a_copy,shift=hor_shift,axis=1)
    a_shifted=np.roll(a_shifted,shift=vert_shift,axis=0)
    idx=~a_shifted.mask*a.mask
    a[idx]=a_shifted[idx]

Note that np.roll happily rolls the lower edge to the top, so a missing value at the top may be filled in by a value from the very bottom. If this is a problem, I'd have to think more about how to fix it. The obvious but not very clever solution would be to use if statements and feed the edges a different sequence of admissible neighbors...

Great! That works for my purposes. One question -could it be generalized to work for larger data gaps where the nearest unmasked value is more than one point away? — Pete W, Sep 07 '10 at 21:03
@Pete - A quick way to do so is to wrap the for loops in a `while np.any(a.mask):`. @unutbu - Damn slick way of implementing nearest neighbor interpolation, by the way! — Joe Kington, Sep 07 '10 at 21:46
Thanks! I've got everything I need now. It works like a charm. — Pete W, Sep 07 '10 at 22:06
really liked this one. I needed to cover the case that takes care of not rolling over lower values to the top, here a solution. You can just copy in reverse order the values to the beginning and end of each dimension (like `np.pad(..., mode='reflect')`, sadly `np.pad` is not working on masked arrays. I inserted the line `a = ma.concatenate([a[:, :0:-1], a, a[:, -2::-1]], axis=-1)` before the above code piece and `return a.data[:, a.shape[1]//3:-(a.shape[1]//3)]` after the code piece. I just copied along the 2nd dimension, as the other one was actually fine rolling over (cylindrical geometry). — David S., Sep 06 '19 at 08:34

sega_sai · Answer 3 · 2012-03-16T13:59:51.393

7

For more complicated cases you could use scipy.spatial:

from scipy.spatial import KDTree
x,y=np.mgrid[0:a.shape[0],0:a.shape[1]]

xygood = np.array((x[~a.mask],y[~a.mask])).T
xybad = np.array((x[a.mask],y[a.mask])).T

a[a.mask] = a[~a.mask][KDTree(xygood).query(xybad)[1]]

print a
  [[0 1 2 3 4 5 6 7 8 9]
  [10 11 12 13 14 15 16 17 18 19]
  [20 21 22 13 14 15 16 17 28 29]
  [30 31 32 32 44 45 46 38 38 39]
  [40 41 42 43 44 45 46 47 48 49]
  [50 51 52 53 54 55 56 57 58 59]
  [60 61 62 63 64 65 66 67 68 69]
  [70 71 72 73 74 75 76 77 78 79]
  [80 81 82 83 84 85 86 87 78 89]
  [90 91 92 93 94 95 96 97 98 99]]

edited Mar 16 '12 at 13:59

answered Mar 14 '12 at 18:34

sega_sai

8,328
1
29
38

Could this same approach be used for extrapolating outside of the convex hull after interpolating some irregularly spaced data using a Nearest Neighbor algorithm? It appears that it could work, but perhaps there are better alternatives. Just wondering, Thanks. – SSZero Jun 24 '12 at 23:40

Fill in missing values with nearest neighbour in Python numpy masked arrays?

3 Answers3

Linked