How to mark data as anomalies based on specific condition in each interval

Question

I try to search for this problem many places and couldn't find the right tools.

I have a simple time series data,

print(anom)

For any sequence of data that is = 1 and span over (for example 1000 time instances). I want to mark those as anomalies (true). Else they should be ignore (as false).

How do I achieve this with pandas or numpy?

I also want to plot those anomalies, with the colour red for example, how do we achieve that?

How do I mark those anomalies (values = 1 that expanse for around 1000 time instances) as red?

mozway · Accepted Answer · 2021-07-23T08:30:09.147

It is not exactly clear which output you expect. Yet, let's consider the following dataset similar to yours:

s = pd.Series(np.random.choice([0,1], size=100, p=[0.7, 0.3]), name='anom')

0     1
1     0
2     0
3     0
4     0
     ..
95    0
96    1
97    1
98    0
99    1
Name: anom, Length: 100, dtype: int64

Looking like:

filtering based on consecutive values

First we calculate the length of the stretches of 1s

length = s.groupby(((s-s.shift().fillna(0)).eq(1).cumsum()*s)).transform(len)*s

This works by identifying the first element of the stretches (s-s.shift().fillna(0)).eq(1) (the difference between one element and the precedent is 1 only in case of 1 preceded by 0, see graph #2 below). Then it makes increasing groups (graph #3) that group each stretch of 1s and the successive stretch of 0s. By multiplying by s, only the 1s are kept in the group (graph #4). Now we can group the data per stretch and calculate each one's length (graph #5). The 0s will be all part of one group, so finally, we remove the zeros by multiplying again by s (graph #6).

Here is the visual representation of the successive steps where (…) denotes the previous step in each graph:

s_valid = s.loc[length<10]
s_anom = s.drop(s_valid.index)

ax = s_valid.plot(marker='o', ls='')
s_anom.plot(marker='o', ls='', ax=ax, color='r')

ax = s.plot()
s_anom.plot(marker='o', ls='', ax=ax, color='r')

other example with 7 as threshold:

original answer

You can easily convert to bool to get anomalies

>>> s.astype(bool)
0      True
1     False
2     False
3     False
4     False
      ...  
95    False
96     True
97     True
98    False
99     True
Name: anom, Length: 100, dtype: bool

Regarding the plot, depending on what you expect you can do:

s_valid = s.loc[~s.astype(bool)]
s_anom = s.loc[s.astype(bool)]

ax = s_valid.plot(marker='o', ls='')
s_anom.plot(marker='o', ls='', ax=ax, color='r')

output:

s_anom = s.loc[s.astype(bool)]
ax = s.plot()
s_anom.plot(marker='o', ls='', ax=ax, color='r')

Thanks for the solution, but one more condition needs to specified, I want to mark as anomaly only if data is consecutively 1 in 90% of the 1000 sample window size. Otherwise, it is not anomaly. Would that makes sense? — T PHo, Jul 20 '21 at 08:47
Yes, no problem, see my updated answer. Here I used a smaller dataset with 10 consecutive 1s as a threshold. — mozway, Jul 20 '21 at 09:15
Hi friend, just came back to this project, could you kindly explain me this funcion ```s.groupby(((s-s.shift().fillna(0)).eq(1).cumsum()*s)).transform(len)*s``` — T PHo, Jul 23 '21 at 06:15
Sure, I added a paragraph to explain it. Let me know if this is unclear — mozway, Jul 23 '21 at 08:30
Hey buddy, I've got your explanation and worked out in python notebook. It's all making sense now, how'd you get this approach? I have no idea where to find this intuition? — T PHo, Jul 24 '21 at 05:09

score 1 · Answer 2 · answered Jul 20 '21 at 09:04

Setup

np.random.seed(1)
anom = pd.Series(np.random.choice([0, 1], p=[0.2, 0.8], size=10000))

print(anom)

0       1
1       1
2       0
3       1
4       0
       ..
9995    1
9996    1
9997    0
9998    0
9999    1
Length: 10000, dtype: int64

Solution

Detection of anomalies

m = anom == 1
c = anom[m].groupby((~m).cumsum()).transform('count')
a = c[c > 25].clip(upper=1) # Detection threshold=25

Plotting the detected anomalies

fig, ax = plt.subplots(figsize=(12, 8))
bars = ax.bar(a.index, a, facecolor='red', edgecolor='red')

How to mark data as anomalies based on specific condition in each interval

2 Answers2

filtering based on consecutive values

original answer

Setup

Solution

Result

Linked