1

I try to search for this problem many places and couldn't find the right tools.

I have a simple time series data,

print(anom)
0        0
1        0
2        0
3        0
4        0
        ..
52777    1
52778    1
52779    0
52780    1

For any sequence of data that is = 1 and span over (for example 1000 time instances). I want to mark those as anomalies (true). Else they should be ignore (as false).

How do I achieve this with pandas or numpy?

I also want to plot those anomalies, with the colour red for example, how do we achieve that?

How do I mark those anomalies (values = 1 that expanse for around 1000 time instances) as red? enter image description here

T PHo
  • 89
  • 1
  • 13

2 Answers2

1

It is not exactly clear which output you expect. Yet, let's consider the following dataset similar to yours:

s = pd.Series(np.random.choice([0,1], size=100, p=[0.7, 0.3]), name='anom')
0     1
1     0
2     0
3     0
4     0
     ..
95    0
96    1
97    1
98    0
99    1
Name: anom, Length: 100, dtype: int64

Looking like:

input data

filtering based on consecutive values

First we calculate the length of the stretches of 1s

length = s.groupby(((s-s.shift().fillna(0)).eq(1).cumsum()*s)).transform(len)*s

This works by identifying the first element of the stretches (s-s.shift().fillna(0)).eq(1) (the difference between one element and the precedent is 1 only in case of 1 preceded by 0, see graph #2 below). Then it makes increasing groups (graph #3) that group each stretch of 1s and the successive stretch of 0s. By multiplying by s, only the 1s are kept in the group (graph #4). Now we can group the data per stretch and calculate each one's length (graph #5). The 0s will be all part of one group, so finally, we remove the zeros by multiplying again by s (graph #6).

Here is the visual representation of the successive steps where (…) denotes the previous step in each graph:

breakdown of stretches length calculation

s_valid = s.loc[length<10]
s_anom = s.drop(s_valid.index)

ax = s_valid.plot(marker='o', ls='')
s_anom.plot(marker='o', ls='', ax=ax, color='r')

line+dots

ax = s.plot()
s_anom.plot(marker='o', ls='', ax=ax, color='r')

line+dots

other example with 7 as threshold:

line+dots ; 7 as threshold

original answer


You can easily convert to bool to get anomalies

>>> s.astype(bool)
0      True
1     False
2     False
3     False
4     False
      ...  
95    False
96     True
97     True
98    False
99     True
Name: anom, Length: 100, dtype: bool

Regarding the plot, depending on what you expect you can do:

s_valid = s.loc[~s.astype(bool)]
s_anom = s.loc[s.astype(bool)]

ax = s_valid.plot(marker='o', ls='')
s_anom.plot(marker='o', ls='', ax=ax, color='r')

output:

data as dots anomalies in red

s_anom = s.loc[s.astype(bool)]
ax = s.plot()
s_anom.plot(marker='o', ls='', ax=ax, color='r')

data as lines anomalies as red dots

mozway
  • 194,879
  • 13
  • 39
  • 75
  • Thanks for the solution, but one more condition needs to specified, I want to mark as anomaly only if data is consecutively 1 in 90% of the 1000 sample window size. Otherwise, it is not anomaly. Would that makes sense? – T PHo Jul 20 '21 at 08:47
  • Yes, no problem, see my updated answer. Here I used a smaller dataset with 10 consecutive 1s as a threshold. – mozway Jul 20 '21 at 09:15
  • Hi friend, just came back to this project, could you kindly explain me this funcion ```s.groupby(((s-s.shift().fillna(0)).eq(1).cumsum()*s)).transform(len)*s``` – T PHo Jul 23 '21 at 06:15
  • Sure, I added a paragraph to explain it. Let me know if this is unclear – mozway Jul 23 '21 at 08:30
  • Hey buddy, I've got your explanation and worked out in python notebook. It's all making sense now, how'd you get this approach? I have no idea where to find this intuition? – T PHo Jul 24 '21 at 05:09
1

Setup

np.random.seed(1)
anom = pd.Series(np.random.choice([0, 1], p=[0.2, 0.8], size=10000))

print(anom)

0       1
1       1
2       0
3       1
4       0
       ..
9995    1
9996    1
9997    0
9998    0
9999    1
Length: 10000, dtype: int64

Solution

Detection of anomalies

m = anom == 1
c = anom[m].groupby((~m).cumsum()).transform('count')
a = c[c > 25].clip(upper=1) # Detection threshold=25

Plotting the detected anomalies

fig, ax = plt.subplots(figsize=(12, 8))
bars = ax.bar(a.index, a, facecolor='red', edgecolor='red')

Result

enter image description here

Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53