I have two Pandas DataFrames df_x and df_y. df_x has two columns 'high target' and 'low target'. Per every row of df_x, I would like to search through the instances of df_y and see whether the 'high target' was reached before the 'low target'. Currently, I implemented the above using .apply. However, my code is too inefficient as it linearly scales with the number of rows in df_x. Any suggestions to optimize/vectorize my code?
def efficient_high_after_low(row, minute_df):
"""True, if high happened after the low, else False.
Args:
row: A pandas dataframe
with high, low,
minute_df: the whole dataframe
"""
minute_df_after = minute_df.loc[row.period_end_idx+pd.Timedelta(minutes=1):]
#print(minute_df_after)
first_highs = (minute_df_after.ge(row['high target']))
first_lows = (minute_df_after.le(row['low target']))
hi_sum, lo_sum = first_highs.sum(), first_lows.sum()
if (len(first_highs) != len(first_lows)):
raise Exception('Unequal length of first_highs and first_lows')
else:
if ((len(first_highs) == 0)):
return None
elif ((hi_sum == 0) & (lo_sum != 0)):
return True
elif ((hi_sum != 0) & (low_sum == 0)):
return False
elif ((hi_sum == 0) & (low_sum == 0)):
return None
elif (first_highs.idxmax() > first_lows.idxmax()):
return True
elif(first_highs.idxmax() < first_lows.idxmax()):
return False
else:
return None
And I do the following to get these boolean values:
df_x.apply(efficient_high_after_low, axis=1, args=(df_y['open'],))
Running the code above on the first 1000 lines takes 4 seconds.