Pandas: run length of NaN holes

Question

I have hundreds of timeseries objects with 100000's of entries in each. Some percentage of the data entries are missing (NaN). It is important to my application whether those are single, scattered NaNs or long sequences of NaNs.

Therefore I would like a function for giving me the runlength of each contiguous sequence of NaN. I can do

myseries.isnull()

to get a series of bool. And I can do moving median or moving average to get an idea about the size of the data holes. However, it would be nice if there was an efficient way of getting a list of hole lenghts for a series.

I.e., it would be nice to have a myfunc so that

a = pdSeries([1, 2, 3, np.nan, 4, np.nan, np.nan, np.nan, 5, np.nan, np.nan])
myfunc(a.isnull())
==> Series([1, 3, 2])

(because there are 1, 3 and 2 NaNs, respectively)

From that, I can make histograms of hole lengths, and of the and or or of isnull of multiple series (that might be substitutes for eachother), and other nice things.

I would also like to get ideas of other ways to quantify the "clumpiness" of the data holes.

score 13 · Accepted Answer · answered May 31 '13 at 12:58

13

import pandas as pd
import numpy as np
import itertools

a = pd.Series([1, 2, 3, np.nan, 4, np.nan, np.nan, np.nan, 5, np.nan, np.nan])
len_holes = [len(list(g)) for k, g in itertools.groupby(a, lambda x: np.isnan(x)) if k]
print len_holes

results in

[1, 3, 2]

answered May 31 '13 at 12:58

Wouter Overmeire

65,766
10
63
43

2

`Series([len(list(g)) for k, g in groupby(a.isnull()) if k])` is probably slightly more efficient. – JAB May 31 '13 at 13:03
Ah yes, 'groupby'. I guess groupby is meant to operate on sorted data, but when not sorting first, it gets me exactly the RLE :) – Bjarke Ebert May 31 '13 at 13:20
Thanks, JAB, I didn't know groupby could take just a single parameter. That's great, because I have a.isnull() already calculated in my code – Bjarke Ebert May 31 '13 at 13:23
@JAB +1 for elegance, and also for efficiency: I got a 5x speedup using your method (3.65ms vs 15.9ms). – A.Wan Jun 10 '14 at 17:28

score 0 · Answer 2 · answered Feb 15 '23 at 07:20

You can use the runlength function from more_itertools:

import pandas as pd
import numpy as np
import more_itertools as mit

a = pd.Series([1, 2, 3, np.nan, 4, np.nan, np.nan, np.nan, 5, np.nan, np.nan])
len_holes = [v[1]  for v in mit.run_length.encode(np.isnan(a)) if v[0]]
print (len_holes)

Pandas: run length of NaN holes

2 Answers2

Linked