How to null N indices following a number in pandas?

Question

I have a Series containing values and NaN's like so:

0    NaN
1    1.0
2    NaN
3    2.0
4    3.0
5    NaN
6    NaN
7    4.0
8    5.0
9    NaN
dtype: float64

And say for example I'd like to null out the next four indices after each initial value, like so:

0    NaN
1    1.0
2    NaN
3    NaN
4    NaN
5    NaN
6    NaN
7    4.0
8    NaN
9    NaN
dtype: float64

(The four directly after the first 1.0 are null'd, as well as the last two directly after the next value, 4.0.)

There might be... what values should be considered when setting nulls? — cs95, Dec 04 '17 at 21:48

Divakar · Accepted Answer · 2017-12-05T00:22:52.570

Approach #1

Here's a loopy way that iterates only through the list of non-null positions -

def nullnext(s, W):
    a = s.values
    idx = np.flatnonzero(s.notnull().values)+1
    last_idx = idx[0]
    a[last_idx:last_idx+W] = np.nan
    for i in idx[1:]:
        if i > last_idx + W:
            last_idx = i
            a[last_idx:last_idx+W] = np.nan
    return s

Sample run -

In [336]: s
Out[336]: 
0    1.0
1    NaN
2    2.0
3    3.0
4    NaN
5    NaN
6    4.0
7    5.0
8    NaN
Name: NaN, dtype: float64

In [337]: nullnext(s, W=4)
Out[337]: 
0    1.0
1    NaN
2    NaN
3    NaN
4    NaN
5    NaN
6    4.0
7    NaN
8    NaN
Name: NaN, dtype: float64

Approach #2

With few tweaks, we can port this onto numba for performance efficiency. The implementation involves using strides. The relevant codes would look something like this -

from numba import njit

# https://stackoverflow.com/a/40085052/ @Divakar
def strided_app(a, L, S ):  # Window len = L, Stride len/stepsize = S
    nrows = ((a.size-L)//S)+1
    n = a.strides[0]
    return np.lib.stride_tricks.as_strided(a, shape=(nrows,L), strides=(S*n,n))

@njit
def set_mask(mask, idx, W):
    last_idx = idx[0]
    mask[0] = True
    l = len(idx)
    for i in range(1,l):
        if idx[i] > last_idx + W:
            last_idx = idx[i]
            mask[i] = True
    return mask


def nullnext_numba(s, W):
    a = s.values
    idx = np.flatnonzero(s.notnull().values)+1

    mask = np.zeros(len(idx),dtype=bool)
    set_mask(mask, idx, W)

    a_ext = np.concatenate((a, np.full(W,np.nan)))
    strided_app(a_ext, W, 1)[idx[mask]] = np.nan
    return pd.Series(a_ext[:-W])

Further improvement

We could optimize it further to improve memory efficiency by avoiding the concatenation and do all those edits in-situ with the input series and hence improve performance as well, like so -

def nullnext_numba_v2(s, W):
    a = s.values
    idx = np.flatnonzero(s.notnull().values)+1

    mask = np.zeros(len(idx),dtype=bool)
    set_mask(mask, idx, W)

    valid_idx = idx[mask]    
    limit_mask = valid_idx < len(a) - W
    strided_app(a, W, 1)[valid_idx[limit_mask]] = np.nan

    leftover_idx = valid_idx[~limit_mask]
    if len(leftover_idx)>0:
        a[leftover_idx[0]:] = np.nan
    return s

score 1 · Answer 2 · answered Dec 04 '17 at 22:03

The loop way, with numba optimisation :

@numba.njit
def Naning(arr,n=4):
    c=0
    for i in range(arr.size):
        if c>0:
            arr[i]=np.NaN
            c-=1
        elif not np.isnan(arr[i]):
            c=n

Run :

In [419]: df=pd.read_clipboard(header=None,index_col=0)

In [420]: df
Out[420]: 
     1
0     
0  NaN
1  1.0
2  NaN
3  2.0
4  3.0
5  NaN
6  NaN
7  4.0
8  5.0
9  NaN

In [421]: arr=df.values.squeeze()

In [422]: %timeit Naning(arr)
662 ns ± 9.46 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Result

In [423]: df
Out[423]: 
     1
0     
0  NaN
1  1.0
2  NaN
3  NaN
4  NaN
5  NaN
6  NaN
7  4.0
8  NaN
9  NaN

How to null N indices following a number in pandas?

2 Answers2