I want to find the starts and stops indexes of blocks of identical values in a numpy array or preferably a pandas DataFrame (blocks along the column for a 2D array, and along the most quickly varying index for a n - dimensional array). I only look for blocks on a single dimension and don't want to agregate nans on different rows.
Starting from that question (Find large number of consecutive values fulfilling condition in a numpy array), I wrote the following solution finding np.nan for a 2D array :
import numpy as np
a = np.array([
[1, np.nan, np.nan, 2],
[np.nan, 1, np.nan, 3],
[np.nan, np.nan, np.nan, np.nan]
])
nan_mask = np.isnan(a)
start_nans_mask = np.hstack((np.resize(nan_mask[:,0],(a.shape[0],1)),
np.logical_and(np.logical_not(nan_mask[:,:-1]), nan_mask[:,1:])
))
stop_nans_mask = np.hstack((np.logical_and(nan_mask[:,:-1], np.logical_not(nan_mask[:,1:])),
np.resize(nan_mask[:,-1], (a.shape[0],1))
))
start_row_idx,start_col_idx = np.where(start_nans_mask)
stop_row_idx,stop_col_idx = np.where(stop_nans_mask)
This lets me for example analyze the distribution of length of patches of missing values before applying pd.fillna.
stop_col_idx - start_col_idx + 1
array([2, 1, 1, 4], dtype=int64)
One more example and the expecting result :
a = np.array([
[1, np.nan, np.nan, 2],
[np.nan, 1, np.nan, np.nan],
[np.nan, np.nan, np.nan, np.nan]
])
array([2, 1, 2, 4], dtype=int64)
and not
array([2, 1, 6], dtype=int64)
My questions are the following :
- Is there a way to optimize my solution (finding starts and ends in a single pass of mask/where operations)?
- Is there a more optimized solution in pandas? (i.e. different solution than just applying mask/where on the DataFrame's values)
- What happens when the underlying array or DataFrame is to big to fit in memory?