mask only where consecutive nans exceeds x

Question

I was answering a question about pandas interpolation method. The OP wanted to use only interpolate where the number of consecutive np.nans was one. The limit=1 option for interpolate will interpolate the first np.nan and stop there. OP wanted to be able to tell that there were in fact more than one np.nan and not even bother with the first one.

I boiled this down to just executing the interpolate as is and mask the consecutive np.nan after the fact.

The question is: What is a generalized solution that takes a 1-d array a and an integer x and produces a boolean mask with False in the positions of x or more consecutive np.nan

Consider the 1-d array a

a = np.array([1, np.nan, np.nan, np.nan, 1, np.nan, 1, 1, np.nan, np.nan, 1, 1])

I'd expect that for x = 2 the mask would look like this

# assume 1 for True and 0 for False 
# a is [  1.  nan  nan  nan   1.  nan   1.   1.  nan  nan   1.   1.]
# mask [  1.   0.   0.   0.   1.   1.   1.   1.   0.   0.   1.   1.]
#                                  ^
#                                  |
#   Notice that this is not masked because there is only one np.nan

I'd expect that for x = 3 the mask would look like this

# assume 1 for True and 0 for False 
# a is [  1.  nan  nan  nan   1.  nan   1.   1.  nan  nan   1.   1.]
# mask [  1.   0.   0.   0.   1.   1.   1.   1.   1.   1.   1.   1.]
#                                  ^              ^    ^
#                                  |              |    |
# Notice that this is not masked because there is less than 3 np.nan's

I look forward to learning from others ideas ;-)

score 1 · Answer 1 · answered Mar 29 '17 at 00:41

I created this generalized solution

import pandas as pd
import numpy as np
from numpy.lib.stride_tricks import as_strided as strided

def mask_knans(a, x):
    a = np.asarray(a)
    k = a.shape[0]

    # I will stride n.  I want to pad with 1 less False than
    # the required number of np.nan's
    n = np.append(np.isnan(a), [False] * (x - 1))

    # prepare the mask and fill it with True
    m = np.empty(k, np.bool8)
    m.fill(True)

    # stride n into a number of columns equal to
    # the required number of np.nan's to mask
    # this is essentially a rolling all operation on isnull
    # also reshape with `[:, None]` in preparation for broadcasting
    # np.where finds the indices where we successfully start
    # x consecutive np.nan's
    s = n.strides[0]
    i = np.where(strided(n, (k + 1 - x, x), (s, s)).all(1))[0][:, None]

    # since I prepped with `[:, None]` when I add `np.arange(x)`
    # I'm including the subsequent indices where the remaining
    # x - 1 np.nan's are
    i = i + np.arange(x)

    # I use `pd.unique` because it doesn't sort and I don't need to sort
    i = pd.unique(i[i < k])

    m[i] = False

    return m

w/o comments

import pandas as pd
import numpy as np
from numpy.lib.stride_tricks import as_strided as strided

def mask_knans(a, x):
    a = np.asarray(a)
    k = a.shape[0]
    n = np.append(np.isnan(a), [False] * (x - 1))
    m = np.empty(k, np.bool8)
    m.fill(True)
    s = n.strides[0]
    i = np.where(strided(n, (k + 1 - x, x), (s, s)).all(1))[0][:, None]
    i = i + np.arange(x)
    i = pd.unique(i[i < k])
    m[i] = False
    return m

demo

mask_knans(a, 2)

[ True False False False  True  True  True  True False False  True  True]

mask_knans(a, 3)

[ True False False False  True  True  True  True  True  True  True  True]

score 1 · Accepted Answer · answered Mar 29 '17 at 02:51

I really like numba for such easy to grasp but hard to "numpyfy" problems! Even though that package might be a bit too heavy for most libraries it allows to write such "python"-like functions without loosing too much speed:

import numpy as np
import numba as nb
import math

@nb.njit
def mask_nan_if_consecutive(arr, limit):  # I'm not good at function names :(
    result = np.ones_like(arr)
    cnt = 0
    for idx in range(len(arr)):
        if math.isnan(arr[idx]):
            cnt += 1
            # If we just reached the limit we need to backtrack,
            # otherwise just mask current.
            if cnt == limit:
                for subidx in range(idx-limit+1, idx+1):
                    result[subidx] = 0
            elif cnt > limit:
                result[idx] = 0
        else:
            cnt = 0

    return result

At least if you worked with pure-python this should be quite easy to understand and it should work:

>>> a = np.array([1, np.nan, np.nan, np.nan, 1, np.nan, 1, 1, np.nan, np.nan, 1, 1])
>>> mask_nan_if_consecutive(a, 1)
array([ 1.,  0.,  0.,  0.,  1.,  0.,  1.,  1.,  0.,  0.,  1.,  1.])
>>> mask_nan_if_consecutive(a, 2)
array([ 1.,  0.,  0.,  0.,  1.,  1.,  1.,  1.,  0.,  0.,  1.,  1.])
>>> mask_nan_if_consecutive(a, 3)
array([ 1.,  0.,  0.,  0.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.])
>>> mask_nan_if_consecutive(a, 4)
array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.])

But the really nice thing about @nb.njit-decorator is, that this function will be fast:

a = np.array([1, np.nan, np.nan, np.nan, 1, np.nan, 1, 1, np.nan, np.nan, 1, 1])
i = 2

res1 = mask_nan_if_consecutive(a, i)
res2 = mask_knans(a, i)
np.testing.assert_array_equal(res1, res2)

%timeit mask_nan_if_consecutive(a, i)  # 100000 loops, best of 3: 6.03 µs per loop
%timeit mask_knans(a, i)               # 1000 loops, best of 3: 302 µs per loop

So for short arrays this is approximatly 50 times faster, even though the difference gets lower it's still faster for longer arrays:

a = np.array([1, np.nan, np.nan, np.nan, 1, np.nan, 1, 1, np.nan, np.nan, 1, 1]*100000)
i = 2

%timeit mask_nan_if_consecutive(a, i)  # 10 loops, best of 3: 20.9 ms per loop
%timeit mask_knans(a, i)               # 10 loops, best of 3: 154 ms per loop

mask only where consecutive nans exceeds x

2 Answers2

Linked