6

Using pandas/python, I want to calculate the longest increasing subsequence of tuples for each DTE group, but efficiently with 13M rows. Right now, using apply/iteration, takes about 10 hours.

Here's roughly my problem:

DTE Strike Bid Ask
1 100 10 11
1 200 16 17
1 300 17 18
1 400 11 12
1 500 12 13
1 600 13 14
2 100 10 30
2 200 15 20
2 300 16 21
import pandas as pd
pd.DataFrame({
    'DTE': [1,1,1,1,1,1,2,2,2],
    'Strike': [100,200,300,400,500,600,100,200,300],
    'Bid': [10,16,17,11,12,13,10,15,16],
    'Ask': [11,17,18,12,13,14,30,20,21],
})

I would like to:

  • group these by DTE. Here we have two groups (DTE 1 and DTE 2). Then within each group...
  • find the longest paired increasing subsequence. Sort-ordering is determined by Strike, which is unique for each DTE group. So 200 Strike comes after 100 Strike.
    • thus, the Bid and the Ask of 200 Strike must be greater than or equal to (not strict) the 100 Strike bid and ask.
    • any strikes in between that does NOT have bids and asks both increasing in value are deleted.

In this case, the answer would be:

DTE Strike Bid Ask
1 100 10 11
1 400 11 12
1 500 12 13
1 600 13 14
2 200 15 20
2 300 16 21

Only the LONGEST increasing subsequence is kept for EACH GROUP, not just any increasing subsequence. All other rows are dropped.

Note that standard Longest increasing subsequence algorithm of O(nlogn) does not work. See https://www.quora.com/How-can-the-SPOJ-problem-LIS2-be-solved for why. The example group DTE 2 values will fail for standard O(nlogn) LIS solution. I am currently using the standard LIS solution for O(n^2). There is a more complicated O(nlog^2n), but I do not think that is my bottleneck.

Since each row must refer to the previous rows to have already computed the longest increasing subsequence at that point, it seems you cannot do this in parallel? which means you can't vectorize? Would that mean that the only way to speed this up would be to use cython? Or are there other concurrent solutions?

My current solution looks like this:

def modify_lsc_row(row, df, longest_lsc):
    lsc_predecessor_count = 0
    lsc_predecessor_index = -1
    df_predecessors = df[(df['Bid'] <= row.Bid) &
                         (df['Ask'] <= row.Ask) &
                         (df['lsc_count'] != -1)]
    if len(df_predecessors) > 0:
        df_predecessors = df_predecessors[(df_predecessors['lsc_count'] == df_predecessors['lsc_count'].max())]
        lsc_predecessor_index = df_predecessors.index.max()
        lsc_predecessor_count = df_predecessors.at[lsc_predecessor_index, 'lsc_count']

    new_predecessor_count = lsc_predecessor_count + 1
    df.at[row.name, 'lsc_count'] = new_predecessor_count
    df.at[row.name, 'prev_index'] = lsc_predecessor_index

    if new_predecessor_count >= longest_lsc.lsc_count:
        longest_lsc.lsc_count = new_predecessor_count
        longest_lsc.lsc_index = row.name

def longest_increasing_bid_ask_subsequence(df):
    original_columns = df.columns
    df.sort_values(['Strike'], ascending=True, inplace=True)

    df.set_index(['Strike'], inplace=True)
    assert df.index.is_unique

    longest_lsc = LongestLsc()
    longest_lsc.lsc_index = df.index.max()
    longest_lsc.lsc_count = 1

    df['lsc_count'] = -1

    df.apply(lambda row: modify_lsc_row(row, df, longest_lsc),
                              axis=1)

    while longest_lsc.lsc_index != -1:
        df.at[longest_lsc.lsc_index, 'keep'] = True
        longest_lsc.lsc_index = df.at[longest_lsc.lsc_index, 'prev_index']

    df.dropna(inplace=True)


    return df.reset_index()[original_columns]


df_groups = df.groupby(['DTE'], group_keys=False, as_index=False)
df_groups.apply(longest_increasing_bid_ask_subsequence)

Update: https://stackoverflow.com/users/15862569/alexander-volkovsky has mentioned I can use pandarallel to parallelize each DTE since those are each independent. That does speed it up by 5x or so. However, I would like to speed it up much more (particularly the actual optimization of the longest increasing subsequence). Separately, pandarallel doesn't seem to work using pycharm (seems to be a known issue https://github.com/nalepae/pandarallel/issues/76 )

Update: Used https://stackoverflow.com/users/15862569/alexander-volkovsky suggestions: namely numba, numpy. Pandarallel actually slowed things down as my thing got faster and faster (probably due to overhead). So removed that. 10 hours -> 2.8 minutes. Quite the success. Some of the biggest slowdowns was changing the n^2 to use numba. Also not using pandas groupby apply even if just for the numba function. I found out that the time for groupby+apply == groupby + pd.concat. and you can remove the pd.concat by using what Alexander said where you just select the rows you want to keep in the end (instead of concating all the different df groups together). Tons of other small optimizations mostly discovered by using the line profiler.

Updated code as follows:

@njit
def set_list_indices(bids, asks, indices, indices_to_keep):
    entries = len(indices)

    lis_count = np.full(entries, 0)
    prev_index = np.full(entries, -1)

    longest_lis_count = -1
    longest_lis_index = -1

    for i in range(entries):
        predecessor_counts = np.where((bids <= bids[i]) & (asks <= asks[i]), lis_count, 0)
        best_predecessor_index = len(predecessor_counts) - np.argmax(predecessor_counts[::-1]) - 1

        if best_predecessor_index < i:
            prev_index[i] = best_predecessor_index

        new_count = predecessor_counts[best_predecessor_index] + 1
        lis_count[i] = new_count

        if new_count >= longest_lis_count:
            longest_lis_count = new_count
            longest_lis_index = i

    while longest_lis_index != -1:
        indices_to_keep[indices[longest_lis_index]] = True
        longest_lis_index = prev_index[longest_lis_index]


# necessary for lis algo, and groupby will preserve the order
df = df.sort_values(['Strike'], ascending=True)

# necessary for rows that were dropped. need reindexing for lis algo
df = df.reset_index(drop=True)

df_groups = df.groupby(['DTE'])

row_indices_to_keep = np.full(len(df.index), False, dtype=bool)
for name, group in df_groups:
    bids = group['Bid'].to_numpy()
    asks = group['Ask'].to_numpy()
    indices = group.index.to_numpy()
    set_list_indices(bids, asks, indices, row_indices_to_keep)

df = df.iloc[row_indices_to_keep]
smci
  • 32,567
  • 20
  • 113
  • 146
Erich Lin
  • 105
  • 5
  • Also, please provide the data in a consumable format – juanpa.arrivillaga May 26 '21 at 05:16
  • Do `Strike` column contains duplicate values? – Corralien May 26 '21 at 05:50
  • @Corralien They do not. updated my post to reflect that – Erich Lin May 26 '21 at 16:06
  • @juanpa.arrivillaga added consumable format data – Erich Lin May 26 '21 at 16:06
  • Assuming the input file is large but static, seems to me you can do a huge amount of preprocessing (and in parallel) at read-time: you could set the index to be a multiindex of `(DTE,Strike)`. Then you could `groupby()` and get `diff(Bid), diff(Ask)`. Now you only need an apply or lambda to logical-and `diff(Bid)` with `diff(Ask)`, to reduce an entire row to a Boolean value. Then count length of runs, or even prune out runs of length < some heuristic. – smci May 27 '21 at 10:00
  • Do you want to find the longest paired increasing subsequence *for each DTE value*, or only the *single longest overall*? How many distinct values does DTE have? As mentioned, if you can guesstimate a heuristic lower-bound for longest subsequence, you could probably ignore most of the input. You can always iteratively lower the heuristic if it's too high. Really we need more of your 13M input file, or else please post random-seeded dummy data. – smci May 27 '21 at 10:04
  • @smci thank you. it would be for each DTE. those are great points you made. i am about to update the post with my solution. but i will definitely consider your thoughts if things slow down. for now, i got the speedup i wanted – Erich Lin May 30 '21 at 02:21
  • **Please upload us your 13M input file, or else please post code that generates random-seeded dummy data**. For example, what is the distribution of (length by) DTE values: is one DTE value responsible for most of the input length? We can't tell. Also if you simply do a UNIX `sort` and pipe the input to presort on the DTE value, that should really accelerate things - for example you could then used a chunked reader (with pushback) and deal with each DTE group in one chunk, without needing to store almost anything. – smci May 30 '21 at 03:31
  • @smci data is proprietary. DTE is fairly evenly spread throughout. Your suggestions sound great. And I will be sure to explore if I need to do so. Thank you! – Erich Lin May 30 '21 at 13:48

1 Answers1

3

What is the complexity of your algorithm of finding the longest increasing subsequence?

This article provides an algorithm with the complexity of O(n log n). Upd: doesn't work. You don't even need to modify the code, because in python comparison works for tuples: assert (1, 2) < (3, 4)

>>> seq=[(10, 11), (16, 17), (17, 18), (11, 12), (12, 13), (13, 14)]
>>> subsequence(seq)
[(10, 11), (11, 12), (12, 13), (13, 14)]

Since each row must refer to the previous rows to have already computed the longest increasing subsequence at that point, it seems you cannot do this in parallel?

Yes, but you can calculate the sequence in parallel for every DTE. You could try something like pandarallel for parallel aggregation after the .groupby().

from pandarallel import pandarallel
pandarallel.initialize()

# just an example of usage:
df.groupby("DTE").parallel_apply(subsequence)

Also try to get rid of pandas (it's pretty slow) and use raw numpy arrays and python structs. You can calculate LIS indexes using an O(n^2) algorithm and then just select required rows using df.iloc

Alexander Volkovsky
  • 2,588
  • 7
  • 13
  • `(2, 1) > (1, 2)` would be `True`, and wouldn't should OP's case. – Sayandip Dutta May 26 '21 at 15:52
  • Super useful. I updated my post to reflect that this helps. Although pandarallel doesn't seem to work with pycharm when I group by multiple keys, but works fine on terminal (seems to be a known issue: https://github.com/nalepae/pandarallel/issues/76) – Erich Lin May 26 '21 at 16:08
  • @Cyttorak is correct. I updated my post to reflect that standard O(nlogn) does not work. https://www.quora.com/How-can-the-SPOJ-problem-LIS2-be-solved explains more, but I decided to use the standard O(n^2) algo. I would like to optimize the LIS algorithm itself. Added the parallel sharding of the DTE already, but still pretty slow. – Erich Lin May 26 '21 at 16:09
  • Agreed with both points. @ErichLin, try to get rid of pandas, its very slow. Make a simple test: `a = np.arange(1000000); %timeit a.sum()` vs `b = pd.Series(a); %timeit b.sum()`. Your can calculate LIS indexes using a tuples and O(n^2) algorithm and then just select required rows using `df.iloc`. Also Python is a rather slow language, so its worth use cython/c/rust extensions for such bottleneck. Some time ago I ran into performance issues with XIRR calculation in python and wrote an rust extension that makes it 70 (!) times faster. – Alexander Volkovsky May 27 '21 at 08:39
  • @AlexanderVolkovsky just updated my post with my now updated solution. mostly used everything you said. – Erich Lin May 30 '21 at 02:32