Vectorized way to find first occurrence per row

Question

I have two Pandas DataFrames df_x and df_y. df_x has two columns 'high target' and 'low target'. Per every row of df_x, I would like to search through the instances of df_y and see whether the 'high target' was reached before the 'low target'. Currently, I implemented the above using .apply. However, my code is too inefficient as it linearly scales with the number of rows in df_x. Any suggestions to optimize/vectorize my code?

def efficient_high_after_low(row, minute_df):
    """True, if high happened after the low, else False.
    Args:
        row: A pandas dataframe
             with high, low,
        minute_df: the whole dataframe
    """

    minute_df_after = minute_df.loc[row.period_end_idx+pd.Timedelta(minutes=1):]
    #print(minute_df_after)
    first_highs = (minute_df_after.ge(row['high target']))
    first_lows = (minute_df_after.le(row['low target']))
    
    hi_sum, lo_sum = first_highs.sum(), first_lows.sum()
    if (len(first_highs) != len(first_lows)):
        raise Exception('Unequal length of first_highs and first_lows')
    else:
        if ((len(first_highs) == 0)):
            return None
    
        elif ((hi_sum == 0) & (lo_sum != 0)):
            return True
        elif ((hi_sum != 0) & (low_sum == 0)):
            return False
        elif ((hi_sum == 0) & (low_sum == 0)):
            return None 
        elif (first_highs.idxmax() > first_lows.idxmax()):
            return True
        elif(first_highs.idxmax() < first_lows.idxmax()):
            return False
        else:
            return None

And I do the following to get these boolean values:

df_x.apply(efficient_high_after_low, axis=1, args=(df_y['open'],))

Running the code above on the first 1000 lines takes 4 seconds.

all sum operations can be done 1 time, assign to a variable and use them in the algorithm `hi_sum = (first_highs).sum()` `low_sum = (first_lows).sum()` — Glauco, May 31 '22 at 15:43
@Glauco Thanks, made the change. The overhead is from applying the function to every row, so the above change does not change the run time significantly — Vanillihoot, May 31 '22 at 15:47

score 0 · Answer 1 · edited Jun 01 '22 at 14:08

0

This is what you could do:

First of all put the open column in your main dataframe, let's call it df (note: this only works if you have the exact same index on df_y, if you don't, consider other solutions like pd.concat or pd.merge_asof)

df = df_x
df["open"] = df_y["open"]

I also took the liberty of renaming your columns.

As long as your timeseries index is ordered, we can reset the index with

df = df.reset_index()

So now we have df something like this (values are made up):

   high_trgt    low_trgt    open
0   8.746911    8.712824    9.243329
1   9.472977    10.190079   9.744083
2   9.445111    10.269676   9.859353
3   9.972061    10.014381   9.132204
4   8.934692    8.914729    11.453276

# You "time" column isn't actually necessary for this solution

We can create a map of where the targets have been hit

map_high = df.open.values >= df.high_trgt.values
map_low = df.open.values <= df.low_trgt.values

Now the resource intensive bit:

df["high_was_hit_on"] = pd.Series([map_high[i+1:].argmax() for i in range(len(map_high)-1)])
df["low_was_hit_on"] = pd.Series([map_low[i+1:].argmax() for i in range(len(map_low)-1)])

Output:

    high_trgt   low_trgt    open        high_was_hit_on low_was_hit_on
0   8.746911    8.712824    9.243329    0               0
1   9.472977    10.190079   9.744083    0               3
2   9.445111    10.269676   9.859353    0               2
3   9.972061    10.014381   9.132204    1               1
4   8.934692    8.914729    11.453276   0               0

What I did here is iterating over the range of the columns, and just checking where the highest value (hence, index) starting from that row is in the map that we created before.

Now we can easily check which happened first by doing:

# Here you can customize what you need the results to be
# when two hits happen at the same time
df["high_after_low"] = df.high_was_hit_on < df.low_was_hit_on

In terms of speed, this is the test over a df with 1M rows:

%timeit find_first_hit(df)
3.46 s ± 253 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Truth be told, this isn't exactly vectorized, but I can't think of anything that you could do to really achieve that here. Hope that my solution is helpful anyway.

edited Jun 01 '22 at 14:08

Vanillihoot

35
6

answered Jun 01 '22 at 09:36

965311532

506
3
14

I need to write test units to confirm, but I think this could work. Do you know why this is more efficient than the method I used? They both rely on looping through the range of the columns. – Vanillihoot Jun 01 '22 at 14:07
I think there are a few reason why my code is faster: first of all I'm using vectorization as much as I can, while you have a lot of logic going on in your function, that probably adds a lot of overhead on every loop. Also, [list comprehension are implemented in C](https://stackoverflow.com/q/22108488/4713169), so they are really fast and probably faster than the `apply` method, which is pretty much just a thin veil over a normal for loop. Also, using `int`s in the looping probably helps. – 965311532 Jun 01 '22 at 14:38
Thanks, any suggestions for taking care of the edge cases efficiently? if the maps are all false, I want to return None for the given row. argmax() returns the first index if they are all False. – Vanillihoot Jun 01 '22 at 15:34
You can be creative with the edge cases as you can handle them after the resource intensive process happens, so you don't need to worry about being ultra-efficient. You can use [`np.select`](https://numpy.org/doc/stable/reference/generated/numpy.select.html) or [`np.where`](https://numpy.org/doc/stable/reference/generated/numpy.where.html) to determine what value to choose based on the condition. Just be careful about when the targets are **never** hit. – 965311532 Jun 01 '22 at 15:39
right, they are **never** hit if the `maps[i+1:].argmax()` is an all `False` array, in which I need to return None. I think I can only take care of that inside the loop though. – Vanillihoot Jun 01 '22 at 15:42
Oh I see, sorry. Yes, you could try something like `[(map_high[i+1:].argmax() if map_high[i+1:].any() else None) for i in range(len(map_high)-1)]`; it should work but I haven't tested it yet. (Naturally this is only for `map_high`, it's just the same for `map_low`) – 965311532 Jun 01 '22 at 15:48
1

it adds about 42% overhead, but still **MUCH** faster than my original solution. – Vanillihoot Jun 01 '22 at 16:05
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/245237/discussion-between-965311532-and-vanillihoot). – 965311532 Jun 01 '22 at 16:29

score 0 · Accepted Answer · answered Nov 14 '22 at 23:00

Solution 1 (Iterative):

def high_after_low(high_targets, 
                   low_targets, 
                   vals, 
                   dic):

    """True, if the high target was hit before low; else, False.
    Args:
        high_target: A NumPy array, the high targets 
                                of each row.
        low_target: A NumPy array, the low targets 
                                of each row.
        vals: A NumPy array, the current values
        dic: A dictionary; maps every row of the targets to the open values.
            
    Returns:
        high_after_low: A pandas series with True, False, and None values per row.
        Meaning of each value:
            True: the timeseries hit high after low (or low was never hit)
            False: the timeseries hit high before low (or high was never hit)
            None: (1) neither low or high were hit, (2) low and high were hit 
                    at the same row
    """

    dic_keys = list(dic.keys())
    size = len(dic_keys)
    
    high_hit_rows = [(((a).argmax()+dic[i] ) if((a:= (vals[dic[i]:] 
                    >= high_target[i])).any()) else np.nan ) for i in range(size)]
    low_hit_rows  = [(((b).argmax()+dic[i]) if((b:= (vals[dic[i]:] 
                    <= low_target[i])).any()) else np.nan ) for i in range(size)]
    
    high_hit_rows = np.array(high_hit_rows, dtype=np.float32)
    low_hit_rows = np.array(low_hit_rows, dtype=np.float32)

    high_after_low = np.empty((size))
    
    high_after_low[:] = np.nan

    high_after_low[np.isnan(low_hit_rows) & (~np.isnan(high_hit_rows))] = False
    high_after_low[(~np.isnan(low_hit_rows)) & np.isnan(high_hit_rows)] = True
    high_after_low[((~np.isnan(low_hit_rows)) & (~np.isnan(high_hit_rows)) 
                    & (low_hit_rows < high_hit_rows))] = True
    high_after_low[((~np.isnan(low_hit_rows)) & (~np.isnan(high_hit_rows)) 
                    & (low_hit_rows > high_hit_rows))] = False
    
    return high_after_low

Solution 1 (Vectorized):

The vectorized solution requires pre-processing of the input array. We need to pre-process the input array into a 2d array such that the i-th row contains the array's values from [i:i+T]. Then,

def vectorized_high_after_low(df, high_values, low_values):
    """Args:
       df: A pandas DataFrame, containing each value per row. Each column contains the values or row i, t rows ahead.
       high_values: The high targets corresponding to each row
       low_values: The low targets corresponding to each row.
    """
    higher = (df.ge(high_values)).idxmax(axis=1)
    lower = (df.le(low_values)).idxmax(axis=1)

    higher[higher==0] = df.shape[1]
    lower[lower==0] = df.shape[1]
    high_after_low = higher < lower
    high_after_low[higher==lower] = np.nan
    
    return high_after_low

Vectorized way to find first occurrence per row

2 Answers2

Solution 1 (Iterative):

Solution 1 (Vectorized):