2

I'm getting some timestamped data (shown below) through an API, and I want to check starting from the most recent entry (which in this case is the last row) how long a certain column value been consecutively greater than a certain threshold number. Here is some sample data that I converted to a df:

    ID          Timestamp               Value
0   20779453    2021-01-18 09:15:00Z    62.47612
1   20779453    2021-01-18 09:20:00Z    54.56400
2   20779453    2021-01-18 09:25:00Z    64.95384    
3   20779453    2021-01-18 09:30:00Z    63.62500    
4   20779453    2021-01-18 09:35:00Z    61.51790

In this case, I want to check how long the Value variable has been greater than 60 (the answer here is 3 times ie 15 mins). I'm not strong with python, and my instinct is to use a for loop to iterate through the df and keep a counter, but I know that using loops with dfs is not a good practice. Here I've created a sample df

data = [[20779453, '2021-01-18 09:15:00Z', 62.47612], [20779453, '2021-01-18 09:20:00Z', 54.56400], 
[20779453, '2021-01-18 09:25:00Z', 64.95384], [20779453, '2021-01-18 09:30:00Z', 63.62500], 
[20779453, '2021-01-18 09:35:00Z', 61.51790]] 
df = pd.DataFrame(data, columns = ['ID', 'Timestamp', 'Value'])

count = df[df['Value']>60]['Value'].count()
count

Output based on above code: 4

Expected output: 3 (since the latest timestamp, the number of times the value has consecutively been greater than 60 is 3)

EDIT: Another test case based on Akshay's answer:

data = [[20779453, '2021-01-18 09:15:00Z', 62.47612], [20779453, '2021-01-18 09:20:00Z', 54.56400], 
[20779453, '2021-01-18 09:25:00Z', 64.95384], [20779453, '2021-01-18 09:30:00Z', 63.62500], 
[20779453, '2021-01-18 09:35:00Z', 61.51790], [20779453, '2021-01-18 09:40:00Z', 63.62500], 
[20779453, '2021-01-18 09:40:00Z', 53.62500],[20779453, '2021-01-18 09:45:00Z', 61.51790]] 
df1 = pd.DataFrame(data, columns = ['ID', 'Timestamp', 'Value'])

from itertools import groupby
max([len(list(g)) for k, g in groupby(df['Value']>60) if k==True])

Expected output: 1

Current output: 4

I think what I didn't get across well is that I only care about the latest subsequent value and not the longest. Any ideas?

Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
Evan Krause
  • 147
  • 1
  • 16

2 Answers2

2

IIUC, you want the length of the longest of sequence from the last timestamp (backwards) where the value is > 60.

Method 1: df.expanding()

You can use df.expanding for this -

sum(df['Value'][::-1].expanding().apply(lambda x: np.all(x>60)))
# 3 

For second example that you shared,

sum(df1['Value'][::-1].expanding().apply(lambda x: np.all(x>60)))
# 1

Explanation -

df1['Value'][::-1] reverses the series and .expanding() applies a sequential check on the expanding groups [0, 0-1, 0-2, 0-3, 0-4...] to check if EVERY value in that group is > 60. If it is it returns 1 else 0. Summing this up will give you the longest latest sequence of timestamps where the condition is met. Check documentation here.


Method 2: itertools.groupby

Try itertools groupby -

from itertools import groupby
[len(list(g)) for k, g in groupby(df['Value']>60) if k==True][-1]
# 3

For second example that you shared,

from itertools import groupby
[len(list(g)) for k, g in groupby(df1['Value']>60) if k==True][-1]
# 1

Explanation -

Itertools groupby is designed specifically to handle sequentially occurring groups.

Example from their documentation -

[k for k, g in groupby('AAAABBBCCDAABBB')] #--> A B C D A B
[list(g) for k, g in groupby('AAAABBBCCD')] #--> AAAA BBB CC D

The groupby df['Value']>60 returns sequential groups of True and False separated into separate lists -

[list(g) for k, g in groupby(df['Value']>60)]
#[[True], [False], [True, True, True]]

SO all you have to do is filter for only True values (where > 60) using k==True and count the lengths of the groups using len(list(g)). Finally, get the last one from them using [-1].

Akshay Sehgal
  • 18,741
  • 3
  • 21
  • 51
1

Here is an alternative solution using numpy and I added a few rows to the example df. This is a vectorized solution, no for loops used however, this does not imply that using df.iterrows() is inefficient, in fact it can be done in Oá´º time by keeping track of the longest interval like you thought of initially.

The basic idea:

  • get df subset that is above price threshold
  • get longest index stretch
  • subtract timestamps at start, end indices

import numpy as np
import pandas as pd


def max_time_stretch(threshold, prices, cols=('id', 'timestamp', 'price')):
    prices.columns = cols
    print(f'{p}\n')
    prices[cols[1]] = pd.to_datetime(prices[cols[1]])
    target = prices[prices[cols[-1]] > threshold].index.values
    longest_seq = max(np.split(target, np.where(np.diff(target) != 1)[0] + 1), key=len)
    if len(longest_seq) >= 2:
        first_idx, last_idx = longest_seq[0], longest_seq[-1]
        return prices[cols[1]].loc[last_idx] - prices[cols[1]].loc[first_idx]


if __name__ == '__main__':
    p = pd.DataFrame(
        [
            [
                20779453,
                20779453,
                20779453,
                20779453,
                20779453,
                20779453,
                20779453,
                20779453,
            ],
            [
                '2021-01-18 09:15:00Z',
                '2021-01-18 09:20:00Z',
                '2021-01-18 09:25:00Z',
                '2021-01-18 09:30:00Z',
                '2021-01-18 09:35:00Z',
                '2021-01-18 09:40:00Z',
                '2021-01-18 09:45:00Z',
                '2021-01-18 09:50:00Z',
            ],
            [62.47612, 54.56400, 64.95384, 63.62500, 61.51790, 62.25435, 60, 60],
        ],
    ).T
    print(f'Maximum interval: {max_time_stretch(60, p)}')

Out:

         id             timestamp    price
0  20779453  2021-01-18 09:15:00Z  62.4761
1  20779453  2021-01-18 09:20:00Z   54.564
2  20779453  2021-01-18 09:25:00Z  64.9538  # start
3  20779453  2021-01-18 09:30:00Z   63.625
4  20779453  2021-01-18 09:35:00Z  61.5179
5  20779453  2021-01-18 09:40:00Z  62.2544  # end
6  20779453  2021-01-18 09:45:00Z       60
7  20779453  2021-01-18 09:50:00Z       60

Maximum interval: 0 days 00:15:00
watch-this
  • 1
  • 4
  • 20