6

I would like to find those indexes of consecutive NaN in a Pandas DataFrame with more than 3 consecutive NaN returning their size. That is:

58234         NaN
58235         NaN
58236    0.424323
58237    0.424323
58238         NaN
58239         NaN
58240         NaN
58241         NaN
58242         NaN
58245         NaN
58246    1.483380
58247    1.483380

Should return something like (58238, 6). The actual format of the return doesn't matter too much. I have found the following.

df.a.isnull().astype(int).groupby(df.a.notnull().astype(int).cumsum()).sum()

But it is not returning the right values per index. This question might be very similar to Identifying consecutive NaN's with pandas but any help would very appreciated as I am a total noob in Pandas.

Noque
  • 343
  • 1
  • 3
  • 11

4 Answers4

8

I broke down the steps:

df['Group']=df.a.notnull().astype(int).cumsum()
df=df[df.a.isnull()]
df=df[df.Group.isin(df.Group.value_counts()[df.Group.value_counts()>3].index)]
df['count']=df.groupby('Group')['Group'].transform('size')
df.drop_duplicates(['Group'],keep='first')
Out[734]: 
        a  Group  count
ID                     
58238 NaN      2      6
Sheldon
  • 1,215
  • 1
  • 15
  • 29
BENY
  • 317,841
  • 20
  • 164
  • 234
3

Assuming df to have those as two columns named : A, B, here's one vectorized approach -

thresh = 3

a = df.A.values
b = df.B.values

idx0 = np.flatnonzero(np.r_[True, np.diff(np.isnan(b))!=0,True])
count = np.diff(idx0)
idx = idx0[:-1]
valid_mask = (count>=thresh) & np.isnan(b[idx])
out_idx = idx[valid_mask]
out_num = a[out_idx]
out_count = count[valid_mask]
out = zip(out_num, out_count)

Sample input, output -

In [285]: df
Out[285]: 
        A         B
0   58234       NaN
1   58235       NaN
2   58236  0.424323
3   58237  0.424323
4   58238       NaN
5   58239       NaN
6   58240       NaN
7   58241       NaN
8   58242       NaN
9   58245       NaN
10  58246  1.483380
11  58247  1.483380

In [286]: out
Out[286]: [(58238, 6)]

With thresh = 2, we have -

In [288]: out
Out[288]: [(58234, 2), (58238, 6)]
Divakar
  • 218,885
  • 19
  • 262
  • 358
1

Unfortunately, groupby doesn't work for NaN values, so here's a somewhat dirty way of doing what you want (dirty in the sense I create a fake column >_>).

As an aside, the way the itertools.groupby function works is that it groups consecutive items that have the same key function value. Enumerate gives an index and the value of nanindices (e.g. if nanindices is [0,1,4,5,6], enumerate returns [(0,0), (1,1), (2,4), (3,5), (4, 6)]). The key function is the index minus the value. Note that when the value and index both go up by one at the same time (i.e. are consecutive), that difference is the same. Therefore, this groups consecutive numbers.

itemgetter(n) is just a callable object you can apply to an item to get it's n^th element using it's __getitem__ function. I mapped it to the result of the groupby simply because you can't call length directly on the iterable, g, it returns. You could simply convert g to a list and call length on that if you don't want to get the actual consecutive values.

import numpy as np
import pandas as pd
import itertools
from operator import itemgetter

locations = []
df = pd.DataFrame([np.NaN]*2+[5]*3+[np.NaN]*3+[4]*3+[3]*2+[np.NaN]*4, columns=['A'])
df['B'] = df.fillna(-1)
nanindices = df.reset_index().groupby('B')['index'].apply(np.array).loc[-1]
for k, g in itertools.groupby(enumerate(nanindices), lambda (i, x): i-x):
    consec = map(itemgetter(1), g)
    num_consec = len(consec)
    if (num_consec >= 3):
        locations.append((consec[0], num_consec))

print locations

For the DF sample I used, the sample data looks like:

     A
0   NaN
1   NaN
2   5.0
3   5.0
4   5.0
5   NaN
6   NaN
7   NaN
8   4.0
9   4.0
10  4.0
11  3.0
12  3.0
13  NaN
14  NaN
15  NaN
16  NaN

And the program prints:

[(5, 3), (13, 4)]
Saedeas
  • 1,548
  • 8
  • 17
0

So this will be a bit slow but I am also a learning rookie in pandas and python. It is super ugly but without knowing anymore about your dataset here is how I would do it.

current_consec = 0
threeormore = 0

for i in dataset[whatever column you need]:
    if pd.isnull(i):
        if current_consec == 3:
            current_consec = 0
            threeormore += 1
        else:
            current_consec += 1
   else:
      current_consec = 0

Because it will run down the indx numerically it will find each that runs in order. ONLY thing is that if you dont want to count each time there are three in a row (Saw for 6 consecutivtely) you would have to modify the code a bit to not change current_consec to 0 and create a pass statement.

Sorry this is a newb answer but it might work, let me know if you find something faster as I'd love to add it to my knowledge base.

Best luck,

Andy M

A. McMaster
  • 303
  • 1
  • 10
  • Thanks Andy, but this would be indeed way too slow as my dataframes count with > 500000 rows – Noque Sep 01 '17 at 21:17