More efficiently extract uninterrupted sequences of numbers from lists containing randomly placed NaN's

Question

I recently converted a voluminous excel file which was filled with random column stretches of empty cells into a pandas dataframe. The resulting dataframe consequently had long stretches of NaN's.

However, after some operations on the dataframe, I created some smaller bunches of NaN's here and there, and those NaN's I wanted to keep. So I tried to write a function that would create a dictionary of chunks of numbers that were separated by a sufficiently large amount of NaN's (so the output would be segmented only by the original Excel file's missing data).

My code:

def nan_stripper(data,bound):

        newdict = {}
        chunk = 0

        i = 0
        while i < len(data):

            if ~np.isnan(data[i]):
                newdict.setdefault('chunk ' + str(chunk),[]).append(data[i])                
                i += 1
                continue

            elif np.isnan(data[i]):

                # Create clear buffer for next chunk of nan's
                buffer = []
                while np.isnan(data[i]):
                    buffer.append(data[i])
                    i += 1 

                    # When stretch ends, append processed nan's if below selected bound,
                    # and prepare for next number segment.
                    if ~np.isnan(data[i]):
                        if len(buffer) < bound + 1:
                            newdict['chunk ' + str(chunk)].extend(buffer)

                        if len(buffer) >= bound + 1:
                            chunk += 1  

        return newdict

With test here, using a NaN bound of 3:

a = np.array([-1,1,2,3,np.nan,np.nan,np.nan,np.nan,4,5,np.nan,np.nan,7,8,9,10])
b = nan_stripper(a,3)

print(b)
{'chunk 0': [-1.0, 1.0, 2.0, 3.0],
 'chunk 1': [4.0, 5.0, nan, nan, 7.0, 8.0, 9.0, 10.0]}

Thing is, I do not believe my code is efficient, given that I used a weird dictionary method (found here) in order to append additional values to single keys. Are there any easy optimizations that I am missing, or would some of you have gone about this a completely different way? I feel this can't possibly be the most pythonic way to do this.

Thank you in advance.

Comparison with answer: After timing both my original approach and Paul Panzer's, these are the results for those interested.

Why is there a new group starting at `7`? There are only 2 `nan`s preceding it. — Paul Panzer, May 18 '18 at 00:13

score 1 · Accepted Answer · answered May 18 '18 at 02:50

Here is a vectorized version. It slides a window of size bound over the data keeping the current nan count and marks the offsets where the count equals bound (i.e. where the window contains only nans).

Afterwards, it lumps together stretches of such marks, splits at the boundaries and discards every other bit (those containing only nan).

import numpy as np

def nan_split(data, bound, make_dict=False):
    data = np.asanyarray(data)
    # find nans, convert boolean mask to int8 to enable basic arithmetic
    m = np.isnan(data).view(np.int8)
    # compute windowed sum (same as windowed nan count)
    m[bound:] -= m[:-bound]
    # find all all-nan window offsets
    m = m.cumsum() == bound
    # find offsets where it switches between all-nan and non all-nan
    idx, = np.where(m[1:] != m[:-1])
    # correct for window size and edge loss
    idx[::2] += 2-bound
    idx[1::2] += 1
    # split
    if make_dict:
        return {f'chunk {i}': c
                for i, c in enumerate(np.split(data, idx)[2*(idx[0]==0)::2])}
    else:
        return np.split(data, idx)[2*(idx[0]==0)::2]

Thanks for the commented and decently-explained answer! As seen in my edit, your code is marvels faster than my original. Would you give some direction on learning how to optimize code as well as you did (books, etc)? You utilized a few functions that I've never used before (for numpy it was asanyarray and view, and enumerate as well). — Coolio2654, May 18 '18 at 19:49
@Coolio2654. I myself did not learn from books, so I can't really recommend any. If I were to make a recommendation it would be to get a solid foundation in algorithms, complexity theory, etc. --- if you haven't already got it. Another thing that helps me is knowing at least one low-level language like `C` because even when using the conviences of a high-level language like `Python` having an idea of how they are likely to be implemented is very helpful when optimizing. I wish I had more concrete advice. Anyway, this is how I got there. Sound fundamentals and practice, practice, practice. — Paul Panzer, May 18 '18 at 20:16

More efficiently extract uninterrupted sequences of numbers from lists containing randomly placed NaN's

1 Answers1