Count consecutive missing values (nan) in more columns with a condition

Question

I want to count the length of time a server is stopped from a dataset. I know the downtime but not the duration.

I have this df:

index                   a          b     c     reboot
2018-06-25 12:51:00    NaN        NaN   NaN     1      
2018-06-25 12:52:00    NaN        NaN   NaN     0    
2018-06-25 12:53:00    NaN        NaN   NaN     0  
2018-06-25 12:54:00    NaN        NaN   NaN     0    
2018-06-25 12:55:00    NaN        NaN   NaN     0    
2018-06-25 12:56:00    NaN        NaN   NaN     0   
2018-06-25 12:57:00    NaN        NaN   NaN     0   
2018-06-25 12:58:00    NaN        0.6   0.6     0
2018-06-25 12:59:00    NaN        NaN   0.5     0  
2018-06-25 13:00:00    NaN        NaN   0.3     0  
2018-06-25 13:01:00   2.55  94.879997  0.23     0
2018-06-25 13:02:00   1.17        Nan  0.13     0
2018-06-25 13:03:00   1.08  98.199997  0.10     0
2018-06-25 13:28:00    NaN        NaN   NaN     1  
2018-06-25 13:29:00    NaN        NaN   NaN     0     
2018-06-25 13:30:00    NaN        NaN   NaN     0
2018-06-25 13:31:00    NaN        NaN   NaN     0
2018-06-25 13:31:00    0.5        0.2   0.1     0
2018-06-25 13:32:00    NaN        NaN   NaN     0 
2018-06-25 13:33:00    NaN        NaN   NaN     0 
2018-06-25 13:34:00     3         0.6   0.5     0

I want to count the rows where a, b and c are all NaN and reboot == 1, with the result in this form:

index                    period      reboot
2018-06-25 12:51:00         7           1
2018-06-25 13:28:00         4           1

I have already tried doing it column by column without the reboot condition.

Input:

index                   a          b     c     reboot
2018-06-25 12:51:00    NaN        NaN   NaN     1      
2018-06-25 12:52:00    NaN        NaN   NaN     0    
2018-06-25 12:53:00    NaN        NaN   NaN     0  
2018-06-25 12:54:00    NaN        NaN   NaN     0    
2018-06-25 12:55:00    NaN        NaN   NaN     0    
2018-06-25 12:56:00    NaN        NaN   NaN     0   
2018-06-25 12:57:00    NaN        NaN   NaN     0   
2018-06-25 12:58:00    NaN        NaN   NaN     0
2018-06-25 12:59:00    NaN        NaN   NaN     0  
2018-06-25 13:00:00    NaN        NaN   NaN     0  
2018-06-25 13:01:00   2.55  94.879997  0.23     0
2018-06-25 13:02:00   1.17        Nan  0.13     0
2018-06-25 13:03:00   1.08  98.199997  0.10     0
2018-06-25 13:28:00    NaN        NaN   NaN     1  
2018-06-25 13:29:00    NaN        NaN   NaN     0     
2018-06-25 13:30:00    NaN        NaN   NaN     0

a=df.index
b=df.b.values
idx0 = np.flatnonzero(np.r_[True, np.diff(np.isnan(b))!=0,True])
count = np.diff(idx0)
idx = idx0[:-1]
valid_mask = (count>=step) & np.isnan(b[idx])
out_idx = idx[valid_mask]
out_num = a[out_idx]
out_count = count[valid_mask]
outb = zip(out_num, out_count)
periodb=list(outb)

Result :

'[(Timestamp('2018-06-25 12:51:00'), 10),
 (Timestamp('2018-06-25 13:28:00'), 3),'

Is always `1` in first row of only missing groups in `reboot` column? — jezrael, Apr 01 '19 at 07:43
yes the reboot started in 1 but we do not know when it ends just we have new values (a,b,c)and we say that the server has started — MoMo, Apr 01 '19 at 07:59
(the reboot it stops when we detect the first new value in a or b or c or all — MoMo, Apr 01 '19 at 08:40
There are 3 only NaNs columns groups (starting `12:51:00`, `13:28` and `13:32:00`), `reboot` is always 1 only for first row of this groups? Or possible e.g. in `12:54:00` ? — jezrael, Apr 01 '19 at 08:43
in the case of 13:32:00 we have just a missing values but is not a reboot because reboot==0 the first reboot in 12:51:00 lasted 7 minutes 12:58:00 i know the server is turned on because i have new value in b and c — MoMo, Apr 01 '19 at 08:59

score 0 · Answer 1 · answered Apr 01 '19 at 08:09

Add another column with "normal" indexes (integers counting up from 0), select the interesting rows, and then find the differences between adjacent values in the added column - since those differences will give you the distances between the rows in the original data.

Something like:

numbered = df.assign(row=range(len(df)))
restarts = numbered[numbered.reboot == 1]
result = restarts.row.shift(-1) - restarts.row

(Reading a little more closely, it seems like part of the problem is only counting the rows with NaN for all the a, b, c values. To do that, filter out all the other rows first, before adding the secondary index column.)

thank you, that gives me an idea just (the reboot it stops when we detect the first new value in a or b or c or all — MoMo, Apr 01 '19 at 08:39

Count consecutive missing values (nan) in more columns with a condition

1 Answers1