(1) Counting rows
Assuming all rows will come with identical timesteps in between them, like in the input data, then a gap of 2 seconds max means exactly one zero, not more, between two ones:
- [1,0,1] gets filled as [1,1,1]
- [1,0,0,1] stays as [1,0,0,1]
In that case, a rather simple one-liner exists using .shift
:
# input & expected data
data = pd.DataFrame(data={'my_event': [0., 0., 1., 1., 0., 1., 0., 0., 0., 1., 1.],
'expected': [0., 0., 1., 1., 1., 1., 0., 0., 0., 1., 1.]},
index=pd.date_range(start='2023-01-01 00:00:00', end='2023-01-01 00:00:10', freq='s'))
# solution
data['filled'] = np.where((data['my_event']==1) | ((data['my_event'].shift(-1)==1) & (data['my_event'].shift(1)==1)), 1 , 0)
Output:
my_event expected filled
2023-01-01 00:00:00 0.0 0.0 0
2023-01-01 00:00:01 0.0 0.0 0
2023-01-01 00:00:02 1.0 1.0 1
2023-01-01 00:00:03 1.0 1.0 1
2023-01-01 00:00:04 0.0 1.0 1
2023-01-01 00:00:05 1.0 1.0 1
2023-01-01 00:00:06 0.0 0.0 0
2023-01-01 00:00:07 0.0 0.0 0
2023-01-01 00:00:08 0.0 0.0 0
2023-01-01 00:00:09 1.0 1.0 1
2023-01-01 00:00:10 1.0 1.0 1
Only row 5 is filled, which is the desired output.
(2) Alternatively, counting time
Now, not assuming the time step is constant throughout the total dataset, I have not found any easy-looking method -but one that works. Step by step:
- Get gap size: view subset dataframe of detections (only value 1) then count time diff from a row to the next. It appears diff works from a column, not from index, so prior to that, copy of datetime index as column
- Merge both datasets
- Back-fill gap size information over each gap
- Substitute zeroes with ones depending on condition over arbitrary max gap size.
It may look like a lot of steps, but on the upside, because this is not counting rows but datetime intervals, this method is robust against changes in acquisition frequency or missing time points.
Step 1 get gap size:
# copy time from index to column for use by .diff()
input_data['time'] = input_data.index
# View subset: rows with detection only
Diffs = input_data[['time']].loc[input_data['my_event']==1]
# Calculate time interval between consecutive detections
Diffs['gap_size'] = Diffs['time'].diff()
# Output:
Diffs
time gap_size
2023-01-01 00:00:02 2023-01-01 00:00:02 NaT
2023-01-01 00:00:03 2023-01-01 00:00:03 0 days 00:00:01
2023-01-01 00:00:05 2023-01-01 00:00:05 0 days 00:00:02
2023-01-01 00:00:09 2023-01-01 00:00:09 0 days 00:00:04
2023-01-01 00:00:10 2023-01-01 00:00:10 0 days 00:00:01
Step 2 merge both datasets
df = pd.concat([input_data, Diffs['gap_size']], axis=1).drop(['time'], axis=1)
df
my_event expected gap_size
2023-01-01 00:00:00 0.0 0.0 NaT
2023-01-01 00:00:01 0.0 0.0 NaT
2023-01-01 00:00:02 1.0 1.0 NaT
2023-01-01 00:00:03 1.0 1.0 0 days 00:00:01
2023-01-01 00:00:04 0.0 1.0 NaT
2023-01-01 00:00:05 1.0 1.0 0 days 00:00:02
2023-01-01 00:00:06 0.0 0.0 NaT
2023-01-01 00:00:07 0.0 0.0 NaT
2023-01-01 00:00:08 0.0 0.0 NaT
2023-01-01 00:00:09 1.0 1.0 0 days 00:00:04
2023-01-01 00:00:10 1.0 1.0 0 days 00:00:01
Step 3 back-fill
df['fill_gap_size'] = df['gap_size'].bfill()
df
my_event expected gap_size fill_gap_size
2023-01-01 00:00:00 0.0 0.0 NaT 0 days 00:00:01
2023-01-01 00:00:01 0.0 0.0 NaT 0 days 00:00:01
2023-01-01 00:00:02 1.0 1.0 NaT 0 days 00:00:01
2023-01-01 00:00:03 1.0 1.0 0 days 00:00:01 0 days 00:00:01
2023-01-01 00:00:04 0.0 1.0 NaT 0 days 00:00:02
2023-01-01 00:00:05 1.0 1.0 0 days 00:00:02 0 days 00:00:02
2023-01-01 00:00:06 0.0 0.0 NaT 0 days 00:00:04
2023-01-01 00:00:07 0.0 0.0 NaT 0 days 00:00:04
2023-01-01 00:00:08 0.0 0.0 NaT 0 days 00:00:04
2023-01-01 00:00:09 1.0 1.0 0 days 00:00:04 0 days 00:00:04
2023-01-01 00:00:10 1.0 1.0 0 days 00:00:01 0 days 00:00:01
Step 4 conditional fill with defined max interval
# define arbitrary max interval
max_interval = np.timedelta64(2, 's')
# duplicate original signal
df['fill_event'] = df['my_event']
# that is used in conditional substitution to avoid filling the top rows
df['cumsum'] = df['my_event'].cumsum()
# conditional substitution: it's a small gap &
df.loc[(df['my_event']==0)
& (df['fill_gap_size']<=max_interval)
& (df['cumsum']>0), 'fill_event'] = 1
# remove temporary columns
df.drop(['gap_size','fill_gap_size','cumsum'],axis=1, inplace=True)
df
my_event expected fill_event
2023-01-01 00:00:00 0.0 0.0 0.0
2023-01-01 00:00:01 0.0 0.0 0.0
2023-01-01 00:00:02 1.0 1.0 1.0
2023-01-01 00:00:03 1.0 1.0 1.0
2023-01-01 00:00:04 0.0 1.0 1.0
2023-01-01 00:00:05 1.0 1.0 1.0
2023-01-01 00:00:06 0.0 0.0 0.0
2023-01-01 00:00:07 0.0 0.0 0.0
2023-01-01 00:00:08 0.0 0.0 0.0
2023-01-01 00:00:09 1.0 1.0 1.0
2023-01-01 00:00:10 1.0 1.0 1.0
So, fill_event == expected: success!