3

I need to find 'participant_id' that have blocks of same elements (NaNs) of a certain length. For example, consider the following df:

       summary  participant_id
13865      3.0              28
13995      NaN              28
14050      3.0              28
14219      5.0              28
14346      NaN              28
14364      4.0              28
14456      4.0              28
14680      NaN              28
14733      3.0              28
14913      2.0              28
15007      4.0              28
15107      4.0              28
15280      NaN              28
15287      3.0              28
15420      2.0              28
15521      2.0              28
15756      NaN              28
15758      3.0              28
15973      NaN              28
16038      4.0              28
16079      6.0              28
16215      4.0              28
16412      NaN              28
16506      6.0              28
16543      6.0              28
16649      2.0              28
16811      NaN              28
16911      NaN              28
16928      3.0              28
17028      2.0              28
11582      NaN              27
11718      2.0              27
11843      NaN              27
11941      2.0              27
12053      NaN              27
12142      NaN              27
12269      NaN              27
12367      4.0              27
12510      NaN              27
12632      NaN              27
12732      NaN              27
12796      2.0              27
12946      NaN              27
13059      NaN              27
13126      2.0              27
13312      NaN              27
13394      3.0              27
13427      2.0              27
13618      NaN              27
13707      NaN              27
13832      NaN              27
13945      NaN              27
14087      NaN              27
14199      NaN              27
14299      NaN              27
14398      NaN              27
14520      NaN              27
14639      NaN              27
14759      NaN              27
14897      NaN              27
15013      NaN              27
15116      NaN              27
15182      3.0              27
15319      NaN              27
15437      NaN              27
15518      3.0              27
15695      NaN              27
15812      NaN              27
15821      2.0              27
15933      2.0              27

If Im interested in blocks of more than 4 consecutive NaNs, than the only option will be participant_id = 27, if I want blocks_length = 2, than the answer will be participant_id = [27,28]

I was trying to follow similar solution, but it didn't work.

Community
  • 1
  • 1
Arnold Klein
  • 2,956
  • 10
  • 31
  • 60

3 Answers3

2

You can count consecutives NaN with custom function and groupby:

N = 4
def f(x):
    a = x.isnull()
    return a.cumsum()-a.cumsum().where(~a).ffill().fillna(0) == N
mask = df.groupby('participant_id', sort=False)['summary'].apply(f)
L = df.loc[mask, 'participant_id'].unique().tolist()
print (L)

Alternative solution:

from functools import reduce

N = 4
nulls = df['summary'].isnull()
df1 = nulls.groupby(df['participant_id']).expanding() \
            .apply(lambda i: reduce(lambda x, y: x+1 if y==1 else 0, i, 0))

L = df1[df1 == N].index.get_level_values(0).unique().tolist()
print (L)
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

groupby helps to get each participants data individually. Then you can calculate the number in any manner you want. The clear and simple one, without using the power of Pandas may be like this

block_size = 4
for name, gr_data in data.groupby("participant_id"):
    counter = 0
    for value in gr_data["summary"]:
        if value is None:
            counter+=1
            if counter>=block_size:
                print("%s has block of NaN of length >= %d"%(str(name), block_size))
                break
        else:
            counter = 0
Leonid Mednikov
  • 943
  • 4
  • 13
1
def null_blocks(x, n):
    isnull = np.isnan(x.values)
    nextnot = np.append(~isnull[1:], True)
    csum = isnull.cumsum()
    return np.diff(csum[isnull & nextnot]).max() >= n

def which_ids(k):
    return [n for n, g in df.groupby('participant_id').summary if null_blocks(g, k)]

demo

which_ids(2)

[27, 28]

which_ids(4)

[27]

how it works

  • null_blocks
    • I start with a boolean series of where the values are NaN with np.isnan
    • Since bool is a sub class of int we can sum them up with cumsum
    • We can then identify where a block ends by taking the negation of isnull and shifting it one space. When nextnot and isnull are both True that is the end of a block.
    • I then slice the csum with the locations of ends of blocks and take the differences... this gives the size of the blocks.
    • If the max size of blocks is greater than our threshold, then return True
  • which_ids
    • use a list comprehension on a groupby object
    • only return groupby names where the group itself has a block size greater than our threshold.
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • thanks for your solution, but it fails and gives errors. I'm sure I can tweak your solution, but jezrael's worked just fine. Thanks again! – Arnold Klein Apr 19 '17 at 17:30