How to fill gaps in anomaly detection data using pandas?

Question

Assume I have a pandas DataFrame that only consists of 0 and 1 depending if an anomaly was detected or not:

input_data = pd.DataFrame(data={'my_event': [0., 0., 1., 1., 0., 1., 0., 0., 0., 1., 1.]},
                          index=pd.date_range(start='2023-01-01 00:00:00', end='2023-01-01 00:00:10', freq='s'))

Now I would like to fill gaps in the detection depending on their size. E.g. I only want to fill gaps that are 2 seconds or shorter. What is the correct way to do something like this?

I found these questions here: 1, 2, 3 but the solutions seem to be not very straight forward. It kinda feels like there should be a simpler way to solve an issue like this.

EDIT

Sorry for the unprecise question! So a "gap" would in my case be a short time period where no anomaly was detected inside a larger time range that was detected as an anomaly.

For the example input_data the expected output would be a DataFrame with the following data

[0., 0., 1., 1., 1., 1., 0., 0., 0., 1., 1.]

So in this example the single 0. inside the region of ones was replaced by a one. Obviously all zeros could also be replaced by nans, if that would help. I just need to be able to specify the length of the gap that should be filled.

@ScottBoston I updated the question and included the expected output. — Axel, Jul 25 '23 at 19:24
How is the accepted answer correct? It fills slightly larger gaps half-way. It is more obvious with an input_data in the form of 'my_event': [0., 1., 1., 0., 1., 0., 0., 1., 0., 0., 0, 1., 1.], with gaps of respectively 1, 2 and 3 rows. — OCa, Jul 31 '23 at 06:49

ragas · Answer 1 · 2023-07-26T12:48:10.660

3

Please find an alternative answer. Here I'm checking if the consecutive date is less than 2 second or not. If True the flag column is getting updated. Hope this is what you are looking for. From your question it's not very clear.

    import pandas as pd

df = pd.DataFrame(data={'my_event': [0., 0., 1., 1., 0., 1., 0., 0., 0., 1., 1.]},
                          index=pd.date_range(start='2023-01-01 00:00:00', end='2023-01-01 00:00:10', freq='s'))
df = df.reset_index().rename(columns={'index': 'date'})
df['flag'] = np.where(
    (df['my_event'] == 0.0) &
    (df['my_event'].shift(1) == 1.0) & 
    (df['my_event'].shift(-1) == 1.0),
    1,
    0
)
df['final'] = df['my_event'] + df['flag']
print(df)

 

    date  my_event  flag  final
0  2023-01-01 00:00:00       0.0     0    0.0
1  2023-01-01 00:00:01       0.0     0    0.0
2  2023-01-01 00:00:02       1.0     0    1.0
3  2023-01-01 00:00:03       1.0     0    1.0
4  2023-01-01 00:00:04       0.0     1    1.0
5  2023-01-01 00:00:05       1.0     0    1.0
6  2023-01-01 00:00:06       0.0     0    0.0
7  2023-01-01 00:00:07       0.0     0    0.0
8  2023-01-01 00:00:08       0.0     0    0.0
9  2023-01-01 00:00:09       1.0     0    1.0
10 2023-01-01 00:00:10       1.0     0    1.0

edited Jul 26 '23 at 12:48

answered Jul 25 '23 at 18:24

ragas

848
2
7

Hi thanks for your answer! In your output there is a one at 00:00:01 as well as from 00:00:06 to 00:00:08. These should however not be filled as they are not small patches of zeros inside a larger range of ones. – Axel Jul 25 '23 at 19:29
Good use of shift, assuming the acquisition time step does not change. Why the reset_index, though? – OCa Jul 27 '23 at 12:12
@OCa: I'm not comfortable with `Index` as date. I like all the variables as part of main dataframe rather than index. I'm not using `date` anywhere in the code. – ragas Jul 28 '23 at 11:42
thx for reply. Indeed, you were not using the date. This is why I was surprised to see it mentioned in the code at all, from the perspective of an answer, looking for minimal code. – OCa Jul 28 '23 at 12:04

Smordy · Accepted Answer · 2023-07-26T12:32:49.477

i dont know if i understood you well, but to fill gaps in the detection that are 2 seconds or shorter, you can do this :

    import pandas as pd

input_data = pd.DataFrame(data={'my_event': [0., 0., 1., 1., 0., 1., 0., 0., 0., 1., 1.]},
                          index=pd.date_range(start='2023-01-01 00:00:00', end='2023-01-01 00:00:10', freq='s'))

# Find consecutive sequences of 1's
sequences = (input_data['my_event'] == 1).cumsum()

# Calculate the time difference between consecutive events
time_diff = input_data.index.to_series().diff().dt.total_seconds()

# Find the gaps shorter than 2 seconds
gaps = (sequences != sequences.shift(-1)) & (time_diff <= 2)

# Fill the gaps with 1's
input_data['my_event'][gaps] = 1

print(input_data)
     my_event
2023-01-01 00:00:00       0.0
2023-01-01 00:00:01       0.0
2023-01-01 00:00:02       1.0
2023-01-01 00:00:03       1.0
2023-01-01 00:00:04       1.0
2023-01-01 00:00:05       1.0
2023-01-01 00:00:06       0.0
2023-01-01 00:00:07       0.0
2023-01-01 00:00:08       0.0
2023-01-01 00:00:09       1.0
2023-01-01 00:00:10       1.0

Hi! Thanks for your answer! I would have expected the 1 second "gap" at 00:00:04 to be filled with the one. Your output is exactly like the input isn't it? — Axel, Jul 25 '23 at 19:26
Reproducing this answer, I get 2 extra +1 compared to the expected output (?). I attempted to reproduce it because it counts time intervals instead of rows, which makes it robust to changes in acquisition time step (not present in input data, but... interesting). — OCa, Jul 27 '23 at 14:47

OCa · Answer 3 · 2023-07-31T11:55:58.667

(1) Counting rows

Assuming all rows will come with identical timesteps in between them, like in the input data, then a gap of 2 seconds max means exactly one zero, not more, between two ones:

[1,0,1] gets filled as [1,1,1]
[1,0,0,1] stays as [1,0,0,1]

In that case, a rather simple one-liner exists using .shift:

# input & expected data
data = pd.DataFrame(data={'my_event': [0., 0., 1., 1., 0., 1., 0., 0., 0., 1., 1.],
                          'expected': [0., 0., 1., 1., 1., 1., 0., 0., 0., 1., 1.]},
                    index=pd.date_range(start='2023-01-01 00:00:00', end='2023-01-01 00:00:10', freq='s'))

# solution
data['filled'] = np.where((data['my_event']==1) | ((data['my_event'].shift(-1)==1) & (data['my_event'].shift(1)==1)), 1 , 0)

Output:

                     my_event  expected  filled
2023-01-01 00:00:00       0.0       0.0       0
2023-01-01 00:00:01       0.0       0.0       0
2023-01-01 00:00:02       1.0       1.0       1
2023-01-01 00:00:03       1.0       1.0       1
2023-01-01 00:00:04       0.0       1.0       1
2023-01-01 00:00:05       1.0       1.0       1
2023-01-01 00:00:06       0.0       0.0       0
2023-01-01 00:00:07       0.0       0.0       0
2023-01-01 00:00:08       0.0       0.0       0
2023-01-01 00:00:09       1.0       1.0       1
2023-01-01 00:00:10       1.0       1.0       1

Only row 5 is filled, which is the desired output.

(2) Alternatively, counting time

Now, not assuming the time step is constant throughout the total dataset, I have not found any easy-looking method -but one that works. Step by step:

Get gap size: view subset dataframe of detections (only value 1) then count time diff from a row to the next. It appears diff works from a column, not from index, so prior to that, copy of datetime index as column
Merge both datasets
Back-fill gap size information over each gap
Substitute zeroes with ones depending on condition over arbitrary max gap size.

It may look like a lot of steps, but on the upside, because this is not counting rows but datetime intervals, this method is robust against changes in acquisition frequency or missing time points.

Step 1 get gap size:

# copy time from index to column for use by .diff()
input_data['time'] = input_data.index

# View subset: rows with detection only
Diffs = input_data[['time']].loc[input_data['my_event']==1]

# Calculate time interval between consecutive detections
Diffs['gap_size'] = Diffs['time'].diff()

# Output:
Diffs
                                   time        gap_size
2023-01-01 00:00:02 2023-01-01 00:00:02             NaT
2023-01-01 00:00:03 2023-01-01 00:00:03 0 days 00:00:01
2023-01-01 00:00:05 2023-01-01 00:00:05 0 days 00:00:02
2023-01-01 00:00:09 2023-01-01 00:00:09 0 days 00:00:04
2023-01-01 00:00:10 2023-01-01 00:00:10 0 days 00:00:01

Step 2 merge both datasets

df = pd.concat([input_data, Diffs['gap_size']], axis=1).drop(['time'], axis=1)
df
                     my_event  expected        gap_size
2023-01-01 00:00:00       0.0       0.0             NaT
2023-01-01 00:00:01       0.0       0.0             NaT
2023-01-01 00:00:02       1.0       1.0             NaT
2023-01-01 00:00:03       1.0       1.0 0 days 00:00:01
2023-01-01 00:00:04       0.0       1.0             NaT
2023-01-01 00:00:05       1.0       1.0 0 days 00:00:02
2023-01-01 00:00:06       0.0       0.0             NaT
2023-01-01 00:00:07       0.0       0.0             NaT
2023-01-01 00:00:08       0.0       0.0             NaT
2023-01-01 00:00:09       1.0       1.0 0 days 00:00:04
2023-01-01 00:00:10       1.0       1.0 0 days 00:00:01

Step 3 back-fill

df['fill_gap_size'] = df['gap_size'].bfill()
df
                     my_event  expected        gap_size   fill_gap_size
2023-01-01 00:00:00       0.0       0.0             NaT 0 days 00:00:01
2023-01-01 00:00:01       0.0       0.0             NaT 0 days 00:00:01
2023-01-01 00:00:02       1.0       1.0             NaT 0 days 00:00:01
2023-01-01 00:00:03       1.0       1.0 0 days 00:00:01 0 days 00:00:01
2023-01-01 00:00:04       0.0       1.0             NaT 0 days 00:00:02
2023-01-01 00:00:05       1.0       1.0 0 days 00:00:02 0 days 00:00:02
2023-01-01 00:00:06       0.0       0.0             NaT 0 days 00:00:04
2023-01-01 00:00:07       0.0       0.0             NaT 0 days 00:00:04
2023-01-01 00:00:08       0.0       0.0             NaT 0 days 00:00:04
2023-01-01 00:00:09       1.0       1.0 0 days 00:00:04 0 days 00:00:04
2023-01-01 00:00:10       1.0       1.0 0 days 00:00:01 0 days 00:00:01

Step 4 conditional fill with defined max interval

# define arbitrary max interval
max_interval = np.timedelta64(2, 's')

# duplicate original signal
df['fill_event'] = df['my_event']

# that is used in conditional substitution to avoid filling the top rows
df['cumsum'] = df['my_event'].cumsum()

# conditional substitution: it's a small gap & 
df.loc[(df['my_event']==0) 
         & (df['fill_gap_size']<=max_interval)
         & (df['cumsum']>0), 'fill_event'] = 1

# remove temporary columns
df.drop(['gap_size','fill_gap_size','cumsum'],axis=1, inplace=True)
df
                     my_event  expected  fill_event
2023-01-01 00:00:00       0.0       0.0         0.0
2023-01-01 00:00:01       0.0       0.0         0.0
2023-01-01 00:00:02       1.0       1.0         1.0
2023-01-01 00:00:03       1.0       1.0         1.0
2023-01-01 00:00:04       0.0       1.0         1.0
2023-01-01 00:00:05       1.0       1.0         1.0
2023-01-01 00:00:06       0.0       0.0         0.0
2023-01-01 00:00:07       0.0       0.0         0.0
2023-01-01 00:00:08       0.0       0.0         0.0
2023-01-01 00:00:09       1.0       1.0         1.0
2023-01-01 00:00:10       1.0       1.0         1.0

So, fill_event == expected: success!

MZij · Answer 4 · 2023-07-26T08:03:35.160

An easy and straightforward way to handle this, is to let each row look at the two rows above and below. From your question, I assume that each row is evenly spaced with 1 second intervals. Using pandas, this can be done very quickly and efficiently using the .loc and .shift() methods.

The example below should get the expected result in your question. It creates for new columns which contain the shifted data of the my_event columns so that each row can easily look into the data of other rows. Next you can use conditional statements to determine which 0 should be a 1. Your example lacked a situation of a gap of two zeroes, so i added some rows myself

import pandas as pd

input_data = pd.DataFrame(data={'my_event': [0., 0., 1., 1., 0., 1., 0., 0., 0., 1., 1., 0., 0., 1., 1.,]},
                          index=pd.date_range(start='2023-01-01 00:00:00', end='2023-01-01 00:00:14', freq='s'))

#created new columns with shifted data so that each row can 'look' at the data in the rows above and below
input_data["shift1"] = input_data["my_event"].shift(1)
input_data["shift-1"] = input_data["my_event"].shift(-1)
input_data["shift2"] = input_data["my_event"].shift(2)
input_data["shift-2"] = input_data["my_event"].shift(-2)

#use .loc and conditional statements to find for each row the locations 

that have a gap of 1 or 2 and set these to value 1
idx_gap_1 = input_data["my_event"].loc[(input_data["my_event"] == 0) & (input_data["shift1"] == 1) & (input_data["shift-1"] ==1)].index
idx_gap_2 = input_data["my_event"].loc[(input_data["my_event"] == 0) & (input_data["shift1"] == 1) & (input_data["shift-2"] ==1)].index
idx_gap_3 = input_data["my_event"].loc[(input_data["my_event"] == 0) & (input_data["shift-1"] == 1) & (input_data["shift2"] ==1)].index
input_data["my_event"].loc[(idx_gap_1 | idx_gap_2 | idx_gap_3)] = 1


print(input_data["my_event"])

Result:

2023-01-01 00:00:00    0.0
2023-01-01 00:00:01    0.0
2023-01-01 00:00:02    1.0
2023-01-01 00:00:03    1.0
2023-01-01 00:00:04    1.0
2023-01-01 00:00:05    1.0
2023-01-01 00:00:06    0.0
2023-01-01 00:00:07    0.0
2023-01-01 00:00:08    0.0
2023-01-01 00:00:09    1.0
2023-01-01 00:00:10    1.0

How to fill gaps in anomaly detection data using pandas?

4 Answers4