1

I have a Series containing values and NaN's like so:

0    NaN
1    1.0
2    NaN
3    2.0
4    3.0
5    NaN
6    NaN
7    4.0
8    5.0
9    NaN
dtype: float64

And say for example I'd like to null out the next four indices after each initial value, like so:

0    NaN
1    1.0
2    NaN
3    NaN
4    NaN
5    NaN
6    NaN
7    4.0
8    NaN
9    NaN
dtype: float64

(The four directly after the first 1.0 are null'd, as well as the last two directly after the next value, 4.0.)

Jonas Byström
  • 25,316
  • 23
  • 100
  • 147

2 Answers2

2

Approach #1

Here's a loopy way that iterates only through the list of non-null positions -

def nullnext(s, W):
    a = s.values
    idx = np.flatnonzero(s.notnull().values)+1
    last_idx = idx[0]
    a[last_idx:last_idx+W] = np.nan
    for i in idx[1:]:
        if i > last_idx + W:
            last_idx = i
            a[last_idx:last_idx+W] = np.nan
    return s

Sample run -

In [336]: s
Out[336]: 
0    1.0
1    NaN
2    2.0
3    3.0
4    NaN
5    NaN
6    4.0
7    5.0
8    NaN
Name: NaN, dtype: float64

In [337]: nullnext(s, W=4)
Out[337]: 
0    1.0
1    NaN
2    NaN
3    NaN
4    NaN
5    NaN
6    4.0
7    NaN
8    NaN
Name: NaN, dtype: float64

Approach #2

With few tweaks, we can port this onto numba for performance efficiency. The implementation involves using strides. The relevant codes would look something like this -

from numba import njit

# https://stackoverflow.com/a/40085052/ @Divakar
def strided_app(a, L, S ):  # Window len = L, Stride len/stepsize = S
    nrows = ((a.size-L)//S)+1
    n = a.strides[0]
    return np.lib.stride_tricks.as_strided(a, shape=(nrows,L), strides=(S*n,n))

@njit
def set_mask(mask, idx, W):
    last_idx = idx[0]
    mask[0] = True
    l = len(idx)
    for i in range(1,l):
        if idx[i] > last_idx + W:
            last_idx = idx[i]
            mask[i] = True
    return mask


def nullnext_numba(s, W):
    a = s.values
    idx = np.flatnonzero(s.notnull().values)+1

    mask = np.zeros(len(idx),dtype=bool)
    set_mask(mask, idx, W)

    a_ext = np.concatenate((a, np.full(W,np.nan)))
    strided_app(a_ext, W, 1)[idx[mask]] = np.nan
    return pd.Series(a_ext[:-W])

Further improvement

We could optimize it further to improve memory efficiency by avoiding the concatenation and do all those edits in-situ with the input series and hence improve performance as well, like so -

def nullnext_numba_v2(s, W):
    a = s.values
    idx = np.flatnonzero(s.notnull().values)+1

    mask = np.zeros(len(idx),dtype=bool)
    set_mask(mask, idx, W)

    valid_idx = idx[mask]    
    limit_mask = valid_idx < len(a) - W
    strided_app(a, W, 1)[valid_idx[limit_mask]] = np.nan

    leftover_idx = valid_idx[~limit_mask]
    if len(leftover_idx)>0:
        a[leftover_idx[0]:] = np.nan
    return s
Divakar
  • 218,885
  • 19
  • 262
  • 358
1

The loop way, with numba optimisation :

@numba.njit
def Naning(arr,n=4):
    c=0
    for i in range(arr.size):
        if c>0:
            arr[i]=np.NaN
            c-=1
        elif not np.isnan(arr[i]):
            c=n 

Run :

In [419]: df=pd.read_clipboard(header=None,index_col=0)

In [420]: df
Out[420]: 
     1
0     
0  NaN
1  1.0
2  NaN
3  2.0
4  3.0
5  NaN
6  NaN
7  4.0
8  5.0
9  NaN

In [421]: arr=df.values.squeeze()

In [422]: %timeit Naning(arr)
662 ns ± 9.46 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Result

In [423]: df
Out[423]: 
     1
0     
0  NaN
1  1.0
2  NaN
3  NaN
4  NaN
5  NaN
6  NaN
7  4.0
8  NaN
9  NaN
B. M.
  • 18,243
  • 2
  • 35
  • 54