How to find occurrence of consecutive events in python timeseries data frame?

Question

I have got a time series of meteorological observations with date and value columns:

df = pd.DataFrame({'date':['11/10/2017 0:00','11/10/2017 03:00','11/10/2017 06:00','11/10/2017 09:00','11/10/2017 12:00',
                           '11/11/2017 0:00','11/11/2017 03:00','11/11/2017 06:00','11/11/2017 09:00','11/11/2017 12:00',
                          '11/12/2017 00:00','11/12/2017 03:00','11/12/2017 06:00','11/12/2017 09:00','11/12/2017 12:00'],
                  'value':[850,np.nan,np.nan,np.nan,np.nan,500,650,780,np.nan,800,350,690,780,np.nan,np.nan],                   
                   'consecutive_hour': [ 3,0,0,0,0,3,6,9,0,3,3,6,9,0,0]})

With this DataFrame, I want a third column of consecutive_hours such that if the value in a particular timestamp is less than 1000, we give corresponding value in "consecutive-hours" of "3:00" hours and find consecutive such occurrence like 6:00 9:00 as above.

Lastly, I want to summarize the table counting consecutive hours occurrence and number of days such that the summary table looks like:

df_summary = pd.DataFrame({'consecutive_hours':[3,6,9,12],
                      'number_of_day':[2,0,2,0]})

I tried several online solutions and methods like shift(), diff() etc. as mentioned in:How to groupby consecutive values in pandas DataFrame

and more, spent several days but no luck yet.

I would highly appreciate help on this issue. Thanks!

Is `consecutive_hour` column the expected result? If yes, can you explain why `2017-11-11 06:00:00` and `12 2017-11-12 06:00:00` are 9 while `2017-11-10 06:00:00` is 0, please. — Corralien, Apr 29 '21 at 06:31
Yes, consecutive_hour column is the expected result. 2017-11-11 06:00:00 is 9 because if you look at the value column, you see there are values: 500,650 and 780 that occur consecutively in three rows till the time stamp 2017-11-11 06:00:00. same logic goes for 2017-11-12 06:00:00. And 2017-11-10 06:00:00 has 0 **consecutive hour** value because there are Nan Values in this time stamp and on 2017-11-10 03:00:00 . — Peshal1067, Apr 29 '21 at 06:46
Are your sure for `df_summary`? I found `{3: 3, 6: 2, 9: 2, 12: 0}` — Corralien, Apr 29 '21 at 08:45
Yes this is like: **consecutive_hour** == 3(alone) occurs on 2017-11-10 00:00:00 and 2017:11:11 12:00:00 for different day making count of 3 **consecutive hours** equals 2. Highest consecutive value for 2017-11-11 and 2017-11-12 are 9, similarly making count of 9 equals 2 .None of the day have highest **consecutive hours ** of 6 or 12; so they equal 0. — Peshal1067, Apr 29 '21 at 09:53
Sorry, I want to help you but I don't really understand the logic behind this. How 2017-11-11 can be the highest consecutive hours for 3 and 9? — Corralien, Apr 29 '21 at 11:47
From 2017 -11-11 00:00::00 to 2017-11-11 06:00:00 we have continuous observation and for this I would take max consecutive hours for the day equal 9. and there is a break at 09:00:00. At 12:00:00 there is 3 consecutive hours. I want the output with consecutive hours of 9 and 3 for such day 2017-11-11.I tried using: 'df['daily_consecutive_hour'] = df[['consecutive_hour']].resample('D').max()' which gives 9 for 2017-11-11.I could not take into account the last 3 consecutive hours in this code. Sorry that, the problem is little confusing. — Peshal1067, Apr 29 '21 at 12:15

Corralien · Accepted Answer · 2021-04-29T14:19:15.980

Input data:

>>> df
                  date  value
0  2017-11-10 00:00:00  850.0
1  2017-11-10 03:00:00    NaN
2  2017-11-10 06:00:00    NaN
3  2017-11-10 09:00:00    NaN
4  2017-11-10 12:00:00    NaN
5  2017-11-11 00:00:00  500.0
6  2017-11-11 03:00:00  650.0
7  2017-11-11 06:00:00  780.0
8  2017-11-11 09:00:00    NaN
9  2017-11-11 12:00:00  800.0
10 2017-11-12 00:00:00  350.0
11 2017-11-12 03:00:00  690.0
12 2017-11-12 06:00:00  780.0
13 2017-11-12 09:00:00    NaN
14 2017-11-12 12:00:00    NaN

The cumcount_reset function is adapted from this answer of @jezrael:
Python pandas cumsum with reset everytime there is a 0

cumcount_reset = \
    lambda b: b.cumsum().sub(b.cumsum().where(~b).ffill().fillna(0)).astype(int)

df["consecutive_hour"] = (df.set_index("date")["value"] < 1000) \
       .groupby(pd.Grouper(freq="D")) \
       .apply(lambda b: cumcount_reset(b)).mul(3) \
       .reset_index(drop=True)

Output result:

>>> df
                  date  value  consecutive_hour
0  2017-11-10 00:00:00  850.0                 3
1  2017-11-10 03:00:00    NaN                 0
2  2017-11-10 06:00:00    NaN                 0
3  2017-11-10 09:00:00    NaN                 0
4  2017-11-10 12:00:00    NaN                 0
5  2017-11-11 00:00:00  500.0                 3
6  2017-11-11 03:00:00  650.0                 6
7  2017-11-11 06:00:00  780.0                 9
8  2017-11-11 09:00:00    NaN                 0
9  2017-11-11 12:00:00  800.0                 3
10 2017-11-12 00:00:00  350.0                 3
11 2017-11-12 03:00:00  690.0                 6
12 2017-11-12 06:00:00  780.0                 9
13 2017-11-12 09:00:00    NaN                 0
14 2017-11-12 12:00:00    NaN                 0

Summary table

df_summary = df.loc[df.groupby(pd.Grouper(key="date", freq="D"))["consecutive_hour"] \
                      .apply(lambda h: (h - h.shift(-1).fillna(0)) > 0), 
                    "consecutive_hour"] \
               .value_counts().reindex([3, 6, 9, 12], fill_value=0) \
               .rename("number_of_day") \
               .rename_axis("consecutive_hour") \
               .reset_index()

>>> df_summary
   consecutive_hour  number_of_day
0                 3              2
1                 6              0
2                 9              2
3                12              0

@ Corralien , I can not thank you more. You literally saved my day. Can you please assist me on the second part of my question too? I need to find summary table too: df_summary = pd.DataFrame({'consecutive_hours':[3,6,9,12], 'number_of_day':[2,0,2,0]}) — Peshal1067, Apr 29 '21 at 09:23
Don't forget to upvote too if the answer helps you and you did not yet. — Corralien, Apr 29 '21 at 14:25
Thanks a lot again @Corralien .Worked well. Great help. Highly appreciated. I have upvoted. — Peshal1067, Apr 29 '21 at 15:05
~@ Corralien~ I have posted new problem: https://stackoverflow.com/questions/67333038/finding-start-time-and-end-time-of-events-in-a-day-pandas-timeseries-such-th?noredirect=1#comment119015652_67333038 , I would be glad to have your input too. Thanks — Peshal1067, Apr 30 '21 at 13:05

How to find occurrence of consecutive events in python timeseries data frame?

1 Answers1