0

I have a Series of decimal values and a DateTime index, and I would like to flag each item by following this simple rule:

  • 0 if value - 1 is reached before value + 1 in the future
  • 1 for the other way around

Note that the two offsets can vary: it could be -1 and +2 for example.

Here is an example:

2018-01-04 12:00:00    3550.1
2018-01-04 12:01:00    3551.2
2018-01-04 12:02:00    3550.7
2018-01-04 12:03:00    3551.3
2018-01-04 12:04:00    3550.2
2018-01-04 12:05:00    3549.0
2018-01-04 12:06:00    3549.3
2018-01-04 12:07:00    3548.7
2018-01-04 12:08:00    3549.8
2018-01-04 12:09:00    3545.4
Freq: T, Name: close_1T, dtype: float64

For the first 3 rows, that would give:

  • 1 : 3551.2 is reached on the next row
  • 0 : 3550.2 is reached at 12:04:00
  • 0 : 3550.2 is reached at 12:04:00

I tried this:

se_flag = se.apply(lambda x: 0 if len(se[se > x + 1]) == 0
                        else 1 if len(se[se < x - 1]) == 0
                        else 1 if se[se > x + 1].index[0] > se[se < x - 1].index[0]
                        else 0)

The first two members of the lambda are for handling cases when the value is at or near the highest/lowest in the Series.

It seems to do the trick but scales awfully on my real case 1M items Series.

Can you give me some insights on how to make it more performant? Cast the Series to a list? Use a def function rather than a lambda?

Thanks a lot for your help.

Gwalchaved
  • 37
  • 9

1 Answers1

0

So if I understand you correctly, you're looking for the direction of the first deviation that is larger than a in one direction or b in the other (you said they could be variable).

Since the values are 1 and 0, which are essentially the same as TRUE and FALSE, you're saying you want a series where TRUE means the value increases by a before it decreases by b, FALSE otherwise.

Note that when iterating over a series like this, this is almost always faster to keep it as a series than it is to cast it to a list. And lambdas and def functions are equivalent, apart from syntax. Your real speed ups come either from vectorization, or from being smarter about what you're doing.

In this case, what you're doing is much more than you'd need to. You're calculating the minimum and maximum value multiple times, even though they'll probably stay the same. You're also not going it by using the optimized methods for it, but with a custom function. Just use se.min(), se.max() and compare x directly.

In addition, you're checking the index of the first element where something is true, but you're looking at the entire series instead of quitting as soon as you find the first match. You're essentially doing a lot of unnecessary extra work!

Finally, you're looking at the array from the start, instead of from x onwards, is that on purpose? I assume it isn't, because of the example you gave.

In any case, this answer pointed me towards np.argmax, which returns the index of the first element where something is true if the passed numpy array contains only booleans. The only problem is that it returns 0 if no such value exists.

If we know the index, you could do two things: calculate the index for the lower and upper bound and see which occurs first, or calculate the index in an OR loop and then access the array to see which one was right. Performance-wise, I think they should be similar, but it depends how far you want to go!

import numpy as np
import pandas as pd

lower = 1; upper = 1
se = pd.Series([3550.1, 3551.2, 3550.7, 3551.3, 3550.2, 3549.0, 3549.3, 3548.7, 3549.8, 3545.4])

# I put this in a function to prevent polluting the global scope with `counter`
def calculate(se):
  counter = 0
  se_values = se.values
  se_len = len(se)  # otherwise you get an error if `np.argmax` is applied to an empty series

  def goes_up_before_down(x):
    nonlocal counter
    counter += 1
    if counter == se_len: return False
    next_values = se_values[counter:]
    
    idx_first_larger_value = np.argmax(next_values >= x + upper)
    idx_first_smaller_value = np.argmax(next_values <= x - lower)
    
    if idx_first_larger_value == 0 and x + upper >= next_values.max():
      # There is no larger number in the rest of the series
      # we can only go down from here
      return False
    if idx_first_smaller_value == 0 and x - lower <= next_values.min():
      # There is no smaller number in the rest of the series
      # we can only go down from here
      return True
    return idx_first_larger_value < idx_first_smaller_value

  return se.apply(goes_up_before_down)

and this yields:

Out[2]: 
0     True
1    False
2    False
3    False
4    False
5    False
6    False
7     True
8    False
9    False
dtype: bool

which can be cast into

calculate(se).astype(np.int64)
Out[3]:  
0    1
1    0
2    0
3    0
4    0
5    0
6    0
7    1
8    0
9    0
dtype: int64
Ruben Helsloot
  • 12,582
  • 6
  • 26
  • 49
  • 1) Yes, that's right - it only prevents getting `np.argmax([])` 2) You're completely right - I updated the answer – Ruben Helsloot Jul 23 '20 at 09:08
  • I also don't see any use of se_values. Should it be: `return np.argmax(se_values[counter:] > x + upper) < np.argmax(se_values[counter:] < x - lower)` ? – Gwalchaved Jul 23 '20 at 09:12
  • I realised I didn't account for `np.argmax` returning 0 if no index is found, so I updated the answer, where I also use `se_values` now! – Ruben Helsloot Jul 23 '20 at 09:29
  • It works nice, thanks. Processing the million items still takes some time, but it's good enough for my need. – Gwalchaved Jul 23 '20 at 13:22