How to find the number of consecutive values greater than n, looking back from the most recent date

Question

I'm getting some timestamped data (shown below) through an API, and I want to check starting from the most recent entry (which in this case is the last row) how long a certain column value been consecutively greater than a certain threshold number. Here is some sample data that I converted to a df:

    ID          Timestamp               Value
0   20779453    2021-01-18 09:15:00Z    62.47612
1   20779453    2021-01-18 09:20:00Z    54.56400
2   20779453    2021-01-18 09:25:00Z    64.95384    
3   20779453    2021-01-18 09:30:00Z    63.62500    
4   20779453    2021-01-18 09:35:00Z    61.51790

In this case, I want to check how long the Value variable has been greater than 60 (the answer here is 3 times ie 15 mins). I'm not strong with python, and my instinct is to use a for loop to iterate through the df and keep a counter, but I know that using loops with dfs is not a good practice. Here I've created a sample df

data = [[20779453, '2021-01-18 09:15:00Z', 62.47612], [20779453, '2021-01-18 09:20:00Z', 54.56400], 
[20779453, '2021-01-18 09:25:00Z', 64.95384], [20779453, '2021-01-18 09:30:00Z', 63.62500], 
[20779453, '2021-01-18 09:35:00Z', 61.51790]] 
df = pd.DataFrame(data, columns = ['ID', 'Timestamp', 'Value'])

count = df[df['Value']>60]['Value'].count()
count

Output based on above code: 4

Expected output: 3 (since the latest timestamp, the number of times the value has consecutively been greater than 60 is 3)

EDIT: Another test case based on Akshay's answer:

data = [[20779453, '2021-01-18 09:15:00Z', 62.47612], [20779453, '2021-01-18 09:20:00Z', 54.56400], 
[20779453, '2021-01-18 09:25:00Z', 64.95384], [20779453, '2021-01-18 09:30:00Z', 63.62500], 
[20779453, '2021-01-18 09:35:00Z', 61.51790], [20779453, '2021-01-18 09:40:00Z', 63.62500], 
[20779453, '2021-01-18 09:40:00Z', 53.62500],[20779453, '2021-01-18 09:45:00Z', 61.51790]] 
df1 = pd.DataFrame(data, columns = ['ID', 'Timestamp', 'Value'])

from itertools import groupby
max([len(list(g)) for k, g in groupby(df['Value']>60) if k==True])

Expected output: 1

Current output: 4

I think what I didn't get across well is that I only care about the latest subsequent value and not the longest. Any ideas?

Akshay Sehgal · Accepted Answer · 2021-01-20T05:08:19.123

IIUC, you want the length of the longest of sequence from the last timestamp (backwards) where the value is > 60.

Method 1: df.expanding()

You can use df.expanding for this -

sum(df['Value'][::-1].expanding().apply(lambda x: np.all(x>60)))
# 3

For second example that you shared,

sum(df1['Value'][::-1].expanding().apply(lambda x: np.all(x>60)))
# 1

Explanation -

df1['Value'][::-1] reverses the series and .expanding() applies a sequential check on the expanding groups [0, 0-1, 0-2, 0-3, 0-4...] to check if EVERY value in that group is > 60. If it is it returns 1 else 0. Summing this up will give you the longest latest sequence of timestamps where the condition is met. Check documentation here.

Method 2: itertools.groupby

Try itertools groupby -

from itertools import groupby
[len(list(g)) for k, g in groupby(df['Value']>60) if k==True][-1]
# 3

For second example that you shared,

from itertools import groupby
[len(list(g)) for k, g in groupby(df1['Value']>60) if k==True][-1]
# 1

Explanation -

Itertools groupby is designed specifically to handle sequentially occurring groups.

Example from their documentation -

[k for k, g in groupby('AAAABBBCCDAABBB')] #--> A B C D A B
[list(g) for k, g in groupby('AAAABBBCCD')] #--> AAAA BBB CC D

The groupby df['Value']>60 returns sequential groups of True and False separated into separate lists -

[list(g) for k, g in groupby(df['Value']>60)]
#[[True], [False], [True, True, True]]

SO all you have to do is filter for only True values (where > 60) using k==True and count the lengths of the groups using len(list(g)). Finally, get the last one from them using [-1].

watch-this · Answer 2 · 2021-01-20T06:12:44.260

Here is an alternative solution using numpy and I added a few rows to the example df. This is a vectorized solution, no for loops used however, this does not imply that using df.iterrows() is inefficient, in fact it can be done in Oᴺ time by keeping track of the longest interval like you thought of initially.

The basic idea:

get df subset that is above price threshold
get longest index stretch
subtract timestamps at start, end indices

import numpy as np
import pandas as pd


def max_time_stretch(threshold, prices, cols=('id', 'timestamp', 'price')):
    prices.columns = cols
    print(f'{p}\n')
    prices[cols[1]] = pd.to_datetime(prices[cols[1]])
    target = prices[prices[cols[-1]] > threshold].index.values
    longest_seq = max(np.split(target, np.where(np.diff(target) != 1)[0] + 1), key=len)
    if len(longest_seq) >= 2:
        first_idx, last_idx = longest_seq[0], longest_seq[-1]
        return prices[cols[1]].loc[last_idx] - prices[cols[1]].loc[first_idx]


if __name__ == '__main__':
    p = pd.DataFrame(
        [
            [
                20779453,
                20779453,
                20779453,
                20779453,
                20779453,
                20779453,
                20779453,
                20779453,
            ],
            [
                '2021-01-18 09:15:00Z',
                '2021-01-18 09:20:00Z',
                '2021-01-18 09:25:00Z',
                '2021-01-18 09:30:00Z',
                '2021-01-18 09:35:00Z',
                '2021-01-18 09:40:00Z',
                '2021-01-18 09:45:00Z',
                '2021-01-18 09:50:00Z',
            ],
            [62.47612, 54.56400, 64.95384, 63.62500, 61.51790, 62.25435, 60, 60],
        ],
    ).T
    print(f'Maximum interval: {max_time_stretch(60, p)}')

Out:

         id             timestamp    price
0  20779453  2021-01-18 09:15:00Z  62.4761
1  20779453  2021-01-18 09:20:00Z   54.564
2  20779453  2021-01-18 09:25:00Z  64.9538  # start
3  20779453  2021-01-18 09:30:00Z   63.625
4  20779453  2021-01-18 09:35:00Z  61.5179
5  20779453  2021-01-18 09:40:00Z  62.2544  # end
6  20779453  2021-01-18 09:45:00Z       60
7  20779453  2021-01-18 09:50:00Z       60

Maximum interval: 0 days 00:15:00

How to find the number of consecutive values greater than n, looking back from the most recent date

2 Answers2

Method 1: df.expanding()

Method 2: itertools.groupby