count successive occurences in a time series without df.iterrows()

Question

Given a dataframe with a time series like this:

time	event
2020-01-01 12:00:00	1
2020-01-01 12:00:01	NaN
2020-01-01 12:00:02	1
2020-01-01 12:00:03	1
2020-01-01 12:00:04	NaN
2020-01-01 12:00:05	NaN
2020-01-01 12:00:06	1
2020-01-01 12:00:07	Nan

I would like to get a summary-dataframe like:

event_id	time_start	time_stop
1	2020-01-01 12:00:00	2020-01-01 12:00:01
2	2020-01-01 12:00:02	2020-01-01 12:00:04
3	2020-01-01 12:00:06	2020-01-01 12:00:07

In a step-by-step approach I think I should first add an empty column 'event_i', then fill in the index of the events (1,2,3,...) Once this works, I can try to create a summary-dataframe. I am already stuck at giving the index to the events.

I could work something out with df.iterrows() but that is not recommended. How can I vectorize this indexing-procedure?

import pandas as pd
import numpy as np
# define mini-dataset as an example
data= {'time': ['2020-01-01 12:00:00', '2020-01-01 12:00:01', '2020-01-01 12:00:02','2020-01-01 12:00:03',
              '2020-01-01 12:00:04','2020-01-01 12:00:05', '2020-01-01 12:00:06', '2020-01-01 12:00:07',
              '2020-01-01 12:00:08', '2020-01-01 12:00:09','2020-01-01 12:00:10'],
     'event': [1,np.nan,1,1,np.nan,np.nan,1,np.nan,1,1,np.nan]}
df = pd.DataFrame(data)
df['time']=pd.to_datetime((df['time']))

# give a sequential number to each event
df['event_i'] = np.nan

# for each event-number, group by and stack: event_id,  time_start time_stop
# ...

score 1 · Accepted Answer · answered Aug 06 '23 at 14:45

Code

# Create a grouper to mark the intervals of successive events
m = df['event'].isna()
b = m.cumsum().mask(m).ffill(limit=1)

# group the time column by the grouper and agregate with first and last
df1 = df['time'].groupby(b).agg(['first', 'last']).reset_index(drop=True)

# Create event id column
df1['event_id'] = df1.index + 1

                first                last  event_id
0 2020-01-01 12:00:00 2020-01-01 12:00:01         1
1 2020-01-01 12:00:02 2020-01-01 12:00:04         2
2 2020-01-01 12:00:06 2020-01-01 12:00:07         3
3 2020-01-01 12:00:08 2020-01-01 12:00:10         4

Vitalizzare · Answer 2 · 2023-08-06T15:53:43.037

I think we can do better without cumulative operations and grouping, as in the previous answer. All we need is shifting and comparison, which are easy and can be vectorized:

# transform 1 to True, Nan to False
event = df['event'].notna()
previous = event.shift()

# mark start and stop point by comparing neighbors
start = (event > previous)
stop = (event < previous)

# care about corner cases at the first and last positions
start.iloc[0] = event.iloc[0]
stop.iloc[-1] = event.iloc[-1]

# extract data by prepared indexers
answer = pd.DataFrame({
    'time_start': time[start].values,
    'time_stop': time[stop].values})

count successive occurences in a time series without df.iterrows()

2 Answers2

Code