I have a pandas dataframe, df, with 4,000,000 timesteps for a single stock.
The task is, for each timestep, I want to determine if it rises .1% or falls .1% first. So right now I am converting the dataframe to numpy arrays and looping through each timestep, starting at 0 to 4,000,000.
For each timestep, I iterate through the following time steps until I find one where there is a .1% difference in price. If the price rose .1% the label is 1, if it fell .1% the label is 0. This is taking a very long time.
Is it even possible to vectorize this? I tried thinking of a dynamic programming solution to reduce time complexity but I'm not sure if there is one.
high_bid = df['high_bid'].values
high_ask = df['high_ask'].values
low_bid = df['low_bid'].values
low_ask = df['low_ask'].values
open_bid = df['open_bid'].values
open_ask = df['open_ask'].values
labels = np.empty(len(data))
labels[:] = np.nan
for i in range(len(labels)-1):
for j in range(i+1,len(labels)-1):
if (open_ask[i] + (open_ask[i]*target) <= high_bid[j]):
labels[i] = 1
break
elif (open_bid[i] - (open_bid[i]*target) >= low_ask[j]):
labels[i] = 0
break
df['direction'] = labels
Example
time open_bid open_ask high_bid high_ask low_bid \
0 2006-09-19 12:00:00 1.26606 1.26621 1.27063 1.27078 1.26504
1 2006-09-19 13:00:00 1.27010 1.27025 1.27137 1.27152 1.26960
2 2006-09-19 14:00:00 1.27076 1.27091 1.27158 1.27173 1.26979
3 2006-09-19 15:00:00 1.27008 1.27023 1.27038 1.27053 1.26708
4 2006-09-19 16:00:00 1.26816 1.26831 1.26821 1.26836 1.26638
5 2006-09-19 17:00:00 1.26648 1.26663 1.26762 1.26777 1.26606
6 2006-09-19 18:00:00 1.26756 1.26771 1.26781 1.26796 1.26733
7 2006-09-19 19:00:00 1.26763 1.26778 1.26785 1.26800 1.26754
8 2006-09-19 20:00:00 1.26770 1.26785 1.26825 1.26840 1.26765
9 2006-09-19 21:00:00 1.26781 1.26796 1.26791 1.26806 1.26703
low_ask direction
0 1.26519 1
1 1.26975 1
2 1.26994 0
3 1.26723 0
4 1.26653 0
5 1.26621 1
6 1.26748 NaN
7 1.26769 NaN
8 1.26780 NaN
9 1.26718 NaN
I want to add that direction column for all 4 million rows.