I recently converted a voluminous excel file which was filled with random column stretches of empty cells into a pandas dataframe. The resulting dataframe consequently had long stretches of NaN
's.
However, after some operations on the dataframe, I created some smaller bunches of NaN
's here and there, and those NaN
's I wanted to keep. So I tried to write a function that would create a dictionary of chunks of numbers that were separated by a sufficiently large amount of NaN's (so the output would be segmented only by the original Excel file's missing data).
My code:
def nan_stripper(data,bound):
newdict = {}
chunk = 0
i = 0
while i < len(data):
if ~np.isnan(data[i]):
newdict.setdefault('chunk ' + str(chunk),[]).append(data[i])
i += 1
continue
elif np.isnan(data[i]):
# Create clear buffer for next chunk of nan's
buffer = []
while np.isnan(data[i]):
buffer.append(data[i])
i += 1
# When stretch ends, append processed nan's if below selected bound,
# and prepare for next number segment.
if ~np.isnan(data[i]):
if len(buffer) < bound + 1:
newdict['chunk ' + str(chunk)].extend(buffer)
if len(buffer) >= bound + 1:
chunk += 1
return newdict
With test here, using a NaN
bound of 3:
a = np.array([-1,1,2,3,np.nan,np.nan,np.nan,np.nan,4,5,np.nan,np.nan,7,8,9,10])
b = nan_stripper(a,3)
print(b)
{'chunk 0': [-1.0, 1.0, 2.0, 3.0],
'chunk 1': [4.0, 5.0, nan, nan, 7.0, 8.0, 9.0, 10.0]}
Thing is, I do not believe my code is efficient, given that I used a weird dictionary method (found here) in order to append additional values to single keys. Are there any easy optimizations that I am missing, or would some of you have gone about this a completely different way? I feel this can't possibly be the most pythonic way to do this.
Thank you in advance.
Comparison with answer: After timing both my original approach and Paul Panzer's, these are the results for those interested.