Using pandas/python, I want to calculate the longest increasing subsequence of tuples for each DTE
group, but efficiently with 13M rows. Right now, using apply/iteration, takes about 10 hours.
Here's roughly my problem:
DTE | Strike | Bid | Ask |
---|---|---|---|
1 | 100 | 10 | 11 |
1 | 200 | 16 | 17 |
1 | 300 | 17 | 18 |
1 | 400 | 11 | 12 |
1 | 500 | 12 | 13 |
1 | 600 | 13 | 14 |
2 | 100 | 10 | 30 |
2 | 200 | 15 | 20 |
2 | 300 | 16 | 21 |
import pandas as pd
pd.DataFrame({
'DTE': [1,1,1,1,1,1,2,2,2],
'Strike': [100,200,300,400,500,600,100,200,300],
'Bid': [10,16,17,11,12,13,10,15,16],
'Ask': [11,17,18,12,13,14,30,20,21],
})
I would like to:
- group these by
DTE
. Here we have two groups (DTE 1 and DTE 2). Then within each group... - find the longest paired increasing subsequence. Sort-ordering is determined by
Strike
, which is unique for each DTE group. So 200 Strike comes after 100 Strike.- thus, the Bid and the Ask of 200 Strike must be greater than or equal to (not strict) the 100 Strike bid and ask.
- any strikes in between that does NOT have bids and asks both increasing in value are deleted.
In this case, the answer would be:
DTE | Strike | Bid | Ask |
---|---|---|---|
1 | 100 | 10 | 11 |
1 | 400 | 11 | 12 |
1 | 500 | 12 | 13 |
1 | 600 | 13 | 14 |
2 | 200 | 15 | 20 |
2 | 300 | 16 | 21 |
Only the LONGEST increasing subsequence is kept for EACH GROUP, not just any increasing subsequence. All other rows are dropped.
Note that standard Longest increasing subsequence algorithm of O(nlogn)
does not work. See https://www.quora.com/How-can-the-SPOJ-problem-LIS2-be-solved for why. The example group DTE 2 values will fail for standard O(nlogn) LIS solution. I am currently using the standard LIS solution for O(n^2). There is a more complicated O(nlog^2n), but I do not think that is my bottleneck.
Since each row must refer to the previous rows to have already computed the longest increasing subsequence at that point, it seems you cannot do this in parallel? which means you can't vectorize? Would that mean that the only way to speed this up would be to use cython? Or are there other concurrent solutions?
My current solution looks like this:
def modify_lsc_row(row, df, longest_lsc):
lsc_predecessor_count = 0
lsc_predecessor_index = -1
df_predecessors = df[(df['Bid'] <= row.Bid) &
(df['Ask'] <= row.Ask) &
(df['lsc_count'] != -1)]
if len(df_predecessors) > 0:
df_predecessors = df_predecessors[(df_predecessors['lsc_count'] == df_predecessors['lsc_count'].max())]
lsc_predecessor_index = df_predecessors.index.max()
lsc_predecessor_count = df_predecessors.at[lsc_predecessor_index, 'lsc_count']
new_predecessor_count = lsc_predecessor_count + 1
df.at[row.name, 'lsc_count'] = new_predecessor_count
df.at[row.name, 'prev_index'] = lsc_predecessor_index
if new_predecessor_count >= longest_lsc.lsc_count:
longest_lsc.lsc_count = new_predecessor_count
longest_lsc.lsc_index = row.name
def longest_increasing_bid_ask_subsequence(df):
original_columns = df.columns
df.sort_values(['Strike'], ascending=True, inplace=True)
df.set_index(['Strike'], inplace=True)
assert df.index.is_unique
longest_lsc = LongestLsc()
longest_lsc.lsc_index = df.index.max()
longest_lsc.lsc_count = 1
df['lsc_count'] = -1
df.apply(lambda row: modify_lsc_row(row, df, longest_lsc),
axis=1)
while longest_lsc.lsc_index != -1:
df.at[longest_lsc.lsc_index, 'keep'] = True
longest_lsc.lsc_index = df.at[longest_lsc.lsc_index, 'prev_index']
df.dropna(inplace=True)
return df.reset_index()[original_columns]
df_groups = df.groupby(['DTE'], group_keys=False, as_index=False)
df_groups.apply(longest_increasing_bid_ask_subsequence)
Update: https://stackoverflow.com/users/15862569/alexander-volkovsky has mentioned I can use pandarallel to parallelize each DTE since those are each independent. That does speed it up by 5x or so. However, I would like to speed it up much more (particularly the actual optimization of the longest increasing subsequence). Separately, pandarallel doesn't seem to work using pycharm (seems to be a known issue https://github.com/nalepae/pandarallel/issues/76 )
Update: Used https://stackoverflow.com/users/15862569/alexander-volkovsky suggestions: namely numba, numpy. Pandarallel actually slowed things down as my thing got faster and faster (probably due to overhead). So removed that. 10 hours -> 2.8 minutes. Quite the success. Some of the biggest slowdowns was changing the n^2 to use numba. Also not using pandas groupby apply even if just for the numba function. I found out that the time for groupby+apply == groupby + pd.concat. and you can remove the pd.concat by using what Alexander said where you just select the rows you want to keep in the end (instead of concating all the different df groups together). Tons of other small optimizations mostly discovered by using the line profiler.
Updated code as follows:
@njit
def set_list_indices(bids, asks, indices, indices_to_keep):
entries = len(indices)
lis_count = np.full(entries, 0)
prev_index = np.full(entries, -1)
longest_lis_count = -1
longest_lis_index = -1
for i in range(entries):
predecessor_counts = np.where((bids <= bids[i]) & (asks <= asks[i]), lis_count, 0)
best_predecessor_index = len(predecessor_counts) - np.argmax(predecessor_counts[::-1]) - 1
if best_predecessor_index < i:
prev_index[i] = best_predecessor_index
new_count = predecessor_counts[best_predecessor_index] + 1
lis_count[i] = new_count
if new_count >= longest_lis_count:
longest_lis_count = new_count
longest_lis_index = i
while longest_lis_index != -1:
indices_to_keep[indices[longest_lis_index]] = True
longest_lis_index = prev_index[longest_lis_index]
# necessary for lis algo, and groupby will preserve the order
df = df.sort_values(['Strike'], ascending=True)
# necessary for rows that were dropped. need reindexing for lis algo
df = df.reset_index(drop=True)
df_groups = df.groupby(['DTE'])
row_indices_to_keep = np.full(len(df.index), False, dtype=bool)
for name, group in df_groups:
bids = group['Bid'].to_numpy()
asks = group['Ask'].to_numpy()
indices = group.index.to_numpy()
set_list_indices(bids, asks, indices, row_indices_to_keep)
df = df.iloc[row_indices_to_keep]