I have hundreds of timeseries objects with 100000's of entries in each. Some percentage of the data entries are missing (NaN). It is important to my application whether those are single, scattered NaNs or long sequences of NaNs.
Therefore I would like a function for giving me the runlength of each contiguous sequence of NaN. I can do
myseries.isnull()
to get a series of bool. And I can do moving median or moving average to get an idea about the size of the data holes. However, it would be nice if there was an efficient way of getting a list of hole lenghts for a series.
I.e., it would be nice to have a myfunc
so that
a = pdSeries([1, 2, 3, np.nan, 4, np.nan, np.nan, np.nan, 5, np.nan, np.nan])
myfunc(a.isnull())
==> Series([1, 3, 2])
(because there are 1, 3 and 2 NaNs, respectively)
From that, I can make histograms of hole lengths, and of the and
or or
of isnull of multiple series (that might be substitutes for eachother), and other nice things.
I would also like to get ideas of other ways to quantify the "clumpiness" of the data holes.