0

I want to count the length of time a server is stopped from a dataset. I know the downtime but not the duration.

I have this df:

index                   a          b     c     reboot
2018-06-25 12:51:00    NaN        NaN   NaN     1      
2018-06-25 12:52:00    NaN        NaN   NaN     0    
2018-06-25 12:53:00    NaN        NaN   NaN     0  
2018-06-25 12:54:00    NaN        NaN   NaN     0    
2018-06-25 12:55:00    NaN        NaN   NaN     0    
2018-06-25 12:56:00    NaN        NaN   NaN     0   
2018-06-25 12:57:00    NaN        NaN   NaN     0   
2018-06-25 12:58:00    NaN        0.6   0.6     0
2018-06-25 12:59:00    NaN        NaN   0.5     0  
2018-06-25 13:00:00    NaN        NaN   0.3     0  
2018-06-25 13:01:00   2.55  94.879997  0.23     0
2018-06-25 13:02:00   1.17        Nan  0.13     0
2018-06-25 13:03:00   1.08  98.199997  0.10     0
2018-06-25 13:28:00    NaN        NaN   NaN     1  
2018-06-25 13:29:00    NaN        NaN   NaN     0     
2018-06-25 13:30:00    NaN        NaN   NaN     0
2018-06-25 13:31:00    NaN        NaN   NaN     0
2018-06-25 13:31:00    0.5        0.2   0.1     0
2018-06-25 13:32:00    NaN        NaN   NaN     0 
2018-06-25 13:33:00    NaN        NaN   NaN     0 
2018-06-25 13:34:00     3         0.6   0.5     0 

I want to count the rows where a, b and c are all NaN and reboot == 1, with the result in this form:

index                    period      reboot
2018-06-25 12:51:00         7           1
2018-06-25 13:28:00         4           1

I have already tried doing it column by column without the reboot condition.

Input:

index                   a          b     c     reboot
2018-06-25 12:51:00    NaN        NaN   NaN     1      
2018-06-25 12:52:00    NaN        NaN   NaN     0    
2018-06-25 12:53:00    NaN        NaN   NaN     0  
2018-06-25 12:54:00    NaN        NaN   NaN     0    
2018-06-25 12:55:00    NaN        NaN   NaN     0    
2018-06-25 12:56:00    NaN        NaN   NaN     0   
2018-06-25 12:57:00    NaN        NaN   NaN     0   
2018-06-25 12:58:00    NaN        NaN   NaN     0
2018-06-25 12:59:00    NaN        NaN   NaN     0  
2018-06-25 13:00:00    NaN        NaN   NaN     0  
2018-06-25 13:01:00   2.55  94.879997  0.23     0
2018-06-25 13:02:00   1.17        Nan  0.13     0
2018-06-25 13:03:00   1.08  98.199997  0.10     0
2018-06-25 13:28:00    NaN        NaN   NaN     1  
2018-06-25 13:29:00    NaN        NaN   NaN     0     
2018-06-25 13:30:00    NaN        NaN   NaN     0


a=df.index
b=df.b.values
idx0 = np.flatnonzero(np.r_[True, np.diff(np.isnan(b))!=0,True])
count = np.diff(idx0)
idx = idx0[:-1]
valid_mask = (count>=step) & np.isnan(b[idx])
out_idx = idx[valid_mask]
out_num = a[out_idx]
out_count = count[valid_mask]
outb = zip(out_num, out_count)
periodb=list(outb)

Result :

'[(Timestamp('2018-06-25 12:51:00'), 10),
 (Timestamp('2018-06-25 13:28:00'), 3),'
gmds
  • 19,325
  • 4
  • 32
  • 58
MoMo
  • 107
  • 5
  • Is always `1` in first row of only missing groups in `reboot` column? – jezrael Apr 01 '19 at 07:43
  • yes the reboot started in 1 but we do not know when it ends just we have new values (a,b,c)and we say that the server has started – MoMo Apr 01 '19 at 07:59
  • (the reboot it stops when we detect the first new value in a or b or c or all – MoMo Apr 01 '19 at 08:40
  • There are 3 only NaNs columns groups (starting `12:51:00`, `13:28` and `13:32:00`), `reboot` is always 1 only for first row of this groups? Or possible e.g. in `12:54:00` ? – jezrael Apr 01 '19 at 08:43
  • in the case of 13:32:00 we have just a missing values but is not a reboot because reboot==0 the first reboot in 12:51:00 lasted 7 minutes 12:58:00 i know the server is turned on because i have new value in b and c – MoMo Apr 01 '19 at 08:59

1 Answers1

0

Add another column with "normal" indexes (integers counting up from 0), select the interesting rows, and then find the differences between adjacent values in the added column - since those differences will give you the distances between the rows in the original data.

Something like:

numbered = df.assign(row=range(len(df)))
restarts = numbered[numbered.reboot == 1]
result = restarts.row.shift(-1) - restarts.row

(Reading a little more closely, it seems like part of the problem is only counting the rows with NaN for all the a, b, c values. To do that, filter out all the other rows first, before adding the secondary index column.)

Karl Knechtel
  • 62,466
  • 11
  • 102
  • 153
  • thank you, that gives me an idea just (the reboot it stops when we detect the first new value in a or b or c or all – MoMo Apr 01 '19 at 08:39