How to vectorize this python loop involving millions of records

Question

I have a pandas dataframe, df, with 4,000,000 timesteps for a single stock.

The task is, for each timestep, I want to determine if it rises .1% or falls .1% first. So right now I am converting the dataframe to numpy arrays and looping through each timestep, starting at 0 to 4,000,000.

For each timestep, I iterate through the following time steps until I find one where there is a .1% difference in price. If the price rose .1% the label is 1, if it fell .1% the label is 0. This is taking a very long time.

Is it even possible to vectorize this? I tried thinking of a dynamic programming solution to reduce time complexity but I'm not sure if there is one.

high_bid = df['high_bid'].values
high_ask = df['high_ask'].values
low_bid = df['low_bid'].values
low_ask = df['low_ask'].values
open_bid = df['open_bid'].values
open_ask = df['open_ask'].values
labels = np.empty(len(data))
labels[:] = np.nan

for i in range(len(labels)-1):
    for j in range(i+1,len(labels)-1):
        if (open_ask[i] + (open_ask[i]*target) <= high_bid[j]):
            labels[i] = 1
            break
        elif (open_bid[i] - (open_bid[i]*target) >= low_ask[j]):
            labels[i] = 0
            break
df['direction'] = labels

Example

                 time  open_bid  open_ask  high_bid  high_ask  low_bid  \
0 2006-09-19 12:00:00   1.26606   1.26621   1.27063   1.27078  1.26504   

1 2006-09-19 13:00:00   1.27010   1.27025   1.27137   1.27152  1.26960   

2 2006-09-19 14:00:00   1.27076   1.27091   1.27158   1.27173  1.26979   

3 2006-09-19 15:00:00   1.27008   1.27023   1.27038   1.27053  1.26708   

4 2006-09-19 16:00:00   1.26816   1.26831   1.26821   1.26836  1.26638   

5 2006-09-19 17:00:00   1.26648   1.26663   1.26762   1.26777  1.26606   

6 2006-09-19 18:00:00   1.26756   1.26771   1.26781   1.26796  1.26733   

7 2006-09-19 19:00:00   1.26763   1.26778   1.26785   1.26800  1.26754   

8 2006-09-19 20:00:00   1.26770   1.26785   1.26825   1.26840  1.26765   

9 2006-09-19 21:00:00   1.26781   1.26796   1.26791   1.26806  1.26703   

   low_ask  direction  
0  1.26519          1  
1  1.26975          1  
2  1.26994          0  
3  1.26723          0  
4  1.26653          0  
5  1.26621          1  
6  1.26748        NaN  
7  1.26769        NaN  
8  1.26780        NaN  
9  1.26718        NaN

I want to add that direction column for all 4 million rows.

Can you please give an example input and output to test against? Almost certainly the `for` loops can be scrapped but I would like something to test against :) — roganjosh, Jun 04 '18 at 21:56
This question may be useful in order to create the sample input and output [How to make good reproducible pandas example](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) — OriolAbril, Jun 04 '18 at 22:08
you should consider NOT pulling all values and iterating over, but rather use `df.itertuples()` to use an iterator over the dataframe, and pull the values row by row — MrE, Jun 04 '18 at 23:13

phi · Answer 1 · 2018-06-05T07:04:31.463

First solution to try: Cython. In a similar setting, I've got 20-90x speed up just by adding %%cython to my code.

In one Jupyter cell

%load_ext Cython
cimport numpy as np
import numpy as np

cpdef func(np.ndarray high_bid, np.ndarray high_ask, np.ndarray low_bid, np.ndarray low_ask, np.ndarray open_bid, np.ndarray open_ask, np.ndarray labels):
    target = 0.001
    cdef Py_ssize_t i, j, n = len(labels)
    for i in range(n):
        for j in range(i+1, n):
            # The following are just a copy paste of your code
            if (open_ask[i] + (open_ask[i]*target) <= high_bid[j]):
                labels[i] = 1
                break
            elif (open_bid[i] - (open_bid[i]*target) >= low_ask[j]):
                labels[i] = 0
                break

In another Jupyter cell

func(high_bid, high_ask, low_bid, low_ask, open_bid, open_ask, labels, target)

More optimisation

Here is an excellent introduction of cython for pandas

You can speed up more by adding the data type (np.ndarray[double])

Second solution: Use cummax, cummin on high_bid, low_ask in reversed order

target = 0.001
df['highest_bid_from_on'] = df.high_bid.sort_index(ascending=False).cummax().sort_index(ascending=True)
df['lowest_ask_from_on'] = df.low_ask.sort_index(ascending=False).cummin().sort_index(ascending=True)
df['direction'] = np.nan

df.loc[df.open_bid * (1 - target) >= df.lowest_ask_from_on, 'direction'] = 0
df.loc[df.open_ask * (1 + target) <= df.highest_bid_from_on, 'direction'] = 1

score 0 · Answer 2 · answered Jun 05 '18 at 02:18

You can probably also check the expanding() window function, but in a reverse direction to calculate the max_future_high_bid and min_future_low_ask after each row:

# 0.1% increae/decrease
target = 0.001

# new column names
new_columns = [ "max_future_high_bid", "min_future_low_ask" ]

df[new_columns] = df[::-1].expanding(1)\                                                          
                          .agg({'high_bid':'max', 'low_ask':'min'})[::-1] \
                          .shift(-1)

# after you have these two values, you can calculate the direction with apply() function
def get_direction(x):
    if x.max_future_high_bid >= (1 + target) * x.open_ask :
        return 1
    elif (1 - target) * x.open_bid  >= x.min_future_low_ask:
        return 0
    else:
        return None

# calculate the direction
df['direction'] = df.apply(get_direction, axis=1)

How to vectorize this python loop involving millions of records

2 Answers2