0

I have a time series data with polynomial features that exceeds the upper and lower thresholds multiple times,

I want to count how many times the upper and lower thresholds are exceeded.

for example my upper threshold is 35°C and my lower threshold is -45°C.

How do I write a function which counts how many times the data exceeds the upper and lower thresholds and also the time when the data was in range?

Is there a pythonic way to solve this problem?

2 Answers2

4

If your data can contain "runs" of consecutive values above, below, or between the thresholds and you would like to count the runs instead of individual data points, you could label your data, collapse consecutive labels, filter, and count:

In [64]: df = pd.DataFrame({'Temp': [50, 47.7, 45, 0, 0, -1, -1, -2, -10, -30,
    ...:                             -45, -45, -46, -20, -1, 2, 2, 10, 10, 20,
    ...:                             35.5, 35, 36, 20, 0, -10, -45.1, -50]})

Create the labels:

In [65]: df['Category'] = 0

In [66]: df.loc[df['Temp'] <= -45, 'Category'] = -1

In [67]: df.loc[df['Temp'] >= 35, 'Category'] = 1

In [68]: df
Out[68]: 
    Temp  Category
0   50.0         1
1   47.7         1
2   45.0         1
3    0.0         0
...
9  -30.0         0
10 -45.0        -1
11 -45.0        -1
12 -46.0        -1
13 -20.0         0
...
19  20.0         0
20  35.5         1
21  35.0         1
22  36.0         1
23  20.0         0
24   0.0         0
25 -10.0         0
26 -45.1        -1
27 -50.0        -1

Then use Series.shift() to compare and collapse consecutive values:

In [69]: df[df['Category'].shift() != df['Category']]
Out[69]: 
    Temp  Category
0   50.0         1
3    0.0         0
10 -45.0        -1
13 -20.0         0
20  35.5         1
23  20.0         0
26 -45.1        -1

From there it is simple to filter and count based on category:

In [70]: collapsed = df[df['Category'].shift() != df['Category']]

In [71]: (collapsed['Category'] != 0).sum()
Out[71]: 4

In [72]: (collapsed['Category'] == 0).sum()
Out[72]: 3

Series.value_counts() may be of use as well:

In [73]: collapsed['Category'].value_counts()
Out[73]: 
 0    3
-1    2
 1    2
Name: Category, dtype: int64

How do I write a function which ... the time when the data was in range?

If you have time series data, it's easy to shift the collapsed data once more in order to calculate the durations of runs (demonstrated here using the integer index):

In [74]: fake_time_series = collapsed.reset_index()

In [75]: fake_time_series
Out[75]: 
   index  Temp  Category
0      0  50.0         1
1      3   0.0         0
2     10 -45.0        -1
3     13 -20.0         0
4     20  35.5         1
5     23  20.0         0
6     26 -45.1        -1

In [76]: fake_time_series.shift(-1)['index'] - fake_time_series['index']
Out[76]: 
0    3.0
1    7.0
2    3.0
3    7.0
4    3.0
5    3.0
6    NaN
Name: index, dtype: float64
Ilja Everilä
  • 50,538
  • 7
  • 126
  • 127
1

I believe need between for boolean mask, invert it ~ and sum of Trues:

print ((~df['data'].between(-45, 35)).sum())

Sample:

df = pd.DataFrame({'data':[-47,10,0,30,50]})
print (df)
   data
0   -47
1    10
2     0
3    30
4    50

print ((~df['data'].between(-45, 35)).sum())
2

Details:

print (df['data'].between(-45, 35))
0    False
1     True
2     True
3     True
4    False
Name: data, dtype: bool

print (~df['data'].between(-45, 35))
0     True
1    False
2    False
3    False
4     True
Name: data, dtype: bool
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252