2

enter image description here

I have just started learning python and struggling with this code. I have a dataframe which looks like shown in graph.

I want to find the first occurrence of the window in the dataframe which has certain number of values greater than threshold.

For Exmaple:

Let's say dataframe is 1000000 values in dimension. I want divide this in sliding window of 1000 and need to know whether this 1000 values alteast have 10 values which are greater than certain threshold. If first window (point 0-999) does not have the atleast 10 values greater than certain threshold, window will slide and consider values 1-1000. I have to find the index of the first occurrence of window which has atleast 10 values greater than threshold.

Also as I am dealing here streaming data, I need to stop the search when such window in dataframe occurs.

I tried this code but getting key error and not able to solve the problem.

for i in np.arange(0,len(data)-999):
    for j in np.arange(0,1000):
        if data[i+j]>threshold:
            var_count=var_count+1
        if var_count>10:
            print("Anomaly has occurred")

Sample data looks like this which has around 1.8 million rows.

enter image description here

Sample data could looke like this

data_sample=[1,1,0,0,0,2,1,1,1,1,1,2,1,1,1,1,1,1,2,1,2,2,1,0,0,2,2,2,2,1,1,1]            
data_sample=pd.DataFrame(data_sample)

threshold=1
window=5

in that I need at least 2 values which are greater than 1 which would return the index 18 as at that index my window of length 5 has at least 2 values which are greater than 1.

Bhakti
  • 21
  • 4
  • 1
    How do you get to 11? By index 8 you have 2 values above 1. You only get 3 values ab0ve 1 at index 12... – Dan Aug 13 '19 at 13:35
  • I had updated the wrong sample data, I have again updated the question with correct data. – Bhakti Aug 13 '19 at 13:48

1 Answers1

3

You can do it with convolution:

threshold = 10
window_size = 5
count_threshold = 3

kernel = np.ones(window_size)
over_threshold = (data['relevant_column'] > threshold).values
running_count = np.convolve(kernel, over_threshold)
np.nonzero(running_count >= count_threshold)[0]

Or a similar idea using pandas rolling:

np.where(((data['relevant_column'] > threshold).rolling(window_size).sum() >= count_threshold))
Dan
  • 45,079
  • 17
  • 88
  • 157
  • Thank you for your answer. I have one followup question. over_threshold = (data['relevant_column'] > threshold).values Would it not require to consider entire dataframe for this line of code? Or have I misunderstood? – Bhakti Aug 13 '19 at 13:07
  • It depends on your DataFrame. You didn't give any example data so it's impossible to know. It sounded like you were only looking in a single column. Perhaps add some simple sample data with your expected output. Keep it small, ~10 values. – Dan Aug 13 '19 at 13:12
  • I have added the sample data. My sample data is just one column dataframe which has around 1.8 million rows. For that I am getting this error> ValueError: object too deep for desired array – Bhakti Aug 13 '19 at 13:18
  • @Bhakti please make your sample data code, not an image. Code like `pd.DataFrame([...], columns=[...])` and also include the **expected output**. You don't have to use real values, use ones that demonstrate the principle. Use a small rolling window of like 4 instead of 1000 etc. – Dan Aug 13 '19 at 13:20
  • As for your error - https://stackoverflow.com/questions/15923081/valueerror-object-too-deep-for-desired-array-while-using-convolution – Dan Aug 13 '19 at 13:21
  • Thank you for your answer. I have updated the question including sample data. – Bhakti Aug 13 '19 at 13:29
  • Your problem is that you aren't giving the column a name so I assume you're not extracting the series. You need to do it. If you refuse to give it a name then change `data['relevant_column']` to `data[0]`. – Dan Aug 13 '19 at 13:36
  • I am not getting an error when I sliced the dataframe, but I also got result which was simply greater than threshold. – Bhakti Aug 13 '19 at 13:53
  • @Bhakti subtract the `window_size` from the result. – Dan Aug 13 '19 at 13:55
  • Thanks a lot for your answer. – Bhakti Aug 13 '19 at 14:07