What is the most efficient method for calculating per-row historical values in a large pandas dataframe?

Question

Say I have two pandas dataframes (df_a & df_b), where each row represents a toy and features about that toy. Some pretend features:

Was_Sold (Y/N)
Color
Size_Group
Shape
Date_Made

Say df_a is relatively small (10s of thousands of rows) and df_b is relatively large (>1 million rows).

Then for every row in df_a, I want to:

Find all the toys from df_b with the same type as the one from df_a (e.g. the same color group)
The df_b toys must also be made before the given df_a toy
Then find the ratio of those sold (So count sold / count all matched)

What is the most efficient means to make those per-row calculations above?

The best I've came up with so far is something like the below. (Note code might have an error or two as I'm rough typing from a different use case)

cols = ['Color', 'Size_Group', 'Shape']

# Run this calculation for multiple features
for col in cols:
    
    print(col + ' - Started')
    
    # Empty list to build up the calculation in
    ratio_list = []
    
    # Start the iteration
    for row in df_a.itertuples(index=False):
        
        # Relevant values from df_a
        relevant_val = getattr(row, col)
        created_date = row.Date_Made
        
        # df to keep the overall prior toy matches
        prior_toys = df_b[(df_b.Date_Made < created_date) & (df_b[col] == relevant_val)]
        prior_count = len(prior_toys)

        # Now find the ones that were sold
        prior_sold_count = len(prior_toys[prior_toys.Was_Sold == "Y"])
                         
        # Now make the calculation and append to the list
        if prior_count == 0:
            ratio = 0
        else:
            ratio = prior_sold_count / prior_count
        ratio_list.append(ratio)
        
    # Store the calculation in the original df_a
    df_a[col + '_Prior_Sold_Ratio'] = ratio_list
    print(col + ' - Finished')

Using .itertuples() is useful, but this is still pretty slow. Is there a more efficient method or something I'm missing?

EDIT Added the below script which will emulated data for the above scenario:

import numpy as np
import pandas as pd

colors = ['red', 'green', 'yellow', 'blue']
sizes = ['small', 'medium', 'large']
shapes = ['round', 'square', 'triangle', 'rectangle']
sold = ['Y', 'N']
size_df_a = 200
size_df_b = 2000

date_start = pd.to_datetime('2015-01-01')
date_end = pd.to_datetime('2021-01-01')

def random_dates(start, end, n=10):

    start_u = start.value//10**9
    end_u = end.value//10**9

    return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')

df_a = pd.DataFrame(
    {
    'Color': np.random.choice(colors, size_df_a),
    'Size_Group': np.random.choice(sizes, size_df_a),
    'Shape': np.random.choice(shapes, size_df_a),
    'Was_Sold': np.random.choice(sold, size_df_a),
    'Date_Made': random_dates(date_start, date_end, n=size_df_a)
    }
    )

df_b = pd.DataFrame(
    {
    'Color': np.random.choice(colors, size_df_b),
    'Size_Group': np.random.choice(sizes, size_df_b),
    'Shape': np.random.choice(shapes, size_df_b),
    'Was_Sold': np.random.choice(sold, size_df_b),
    'Date_Made': random_dates(date_start, date_end, n=size_df_b)
    }
    )

@Ugurcan I know that is the ideal case, but I don't see how I could vectorize the above kind of logic. — Josh, Dec 14 '21 at 20:09
This is hard to help you since we do not have an example of dataframe, nor the column types and that the provided example is not totally complete. Having a minimal *reproducible* *working* example should help us a lot (so you in the end). The `Some other conditions left out for simplicity` are a problem because the hidden condition may not work with a future provided answer — Jérôme Richard, Dec 14 '21 at 20:56
@JérômeRichard I'll edit out the "other conditions" bit. My point with those was to emphasize I might need to add more conditions in, but we can consider that unimportant. As far column types, I figure it was implied from the example names provided (string and date). I can't provide the real data (and it's far more complex), but I'll see if I can provide a script to generate emulated toy data in a few. — Josh, Dec 14 '21 at 22:57

score 2 · Accepted Answer · answered Dec 15 '21 at 02:20

First of all, I think your computation would be much more efficient using relational database and SQL query. Indeed, the filters can be done by indexing columns, performing a database join, some advance filtering and count the result. An optimized relational database can generate an efficient algorithm based on a simple SQL query (hash-based row grouping, binary search, fast intersection of sets, etc.). Pandas is sadly not very good to perform efficiently advanced requests like this. It is also very slow to iterate over pandas dataframe although I am not sure this can be alleviated in this case using only pandas. Hopefully you can use some Numpy and Python tricks and (partially) implement what fast relational database engines would do.

Additionally, pure-Python object types are slow, especially (unicode) strings. Thus, **converting column types to efficient ones in a first place can save a lot of time (and memory). For example, there is no need for the Was_Sold column to contains "Y"/"N" string objects: a boolean can just be used in that case. Thus let us convert that:

df_b.Was_Sold = df_b.Was_Sold == "Y"

Finally, the current algorithm has a bad complexity: O(Na * Nb) where Na is the number of rows in df_a and Nb is the number of rows in df_b. This is not easy to improve though due to the non-trivial conditions. A first solution is to group df_b by col column ahead-of-time so to avoid an expensive complete iteration of df_b (previously done with df_b[col] == relevant_val). Then, the date of the precomputed groups can be sorted so to perform a fast binary search later. Then you can use Numpy to count boolean values efficiently (using np.sum).

Note that doing prior_toys['Was_Sold'] is a bit faster than prior_toys.Was_Sold.

Here is the resulting code:

cols = ['Color', 'Size_Group', 'Shape']

# Run this calculation for multiple features
for col in cols:
    print(col + ' - Started')
    
    # Empty list to build up the calculation in
    ratio_list = []

    # Split df_b by col and sort each (indexed) group by date
    colGroups = {grId: grDf.sort_values('Date_Made') for grId, grDf in df_b.groupby(col)}

    # Start the iteration
    for row in df_a.itertuples(index=False):
        # Relevant values from df_a
        relevant_val = getattr(row, col)
        created_date = row.Date_Made
        
        # df to keep the overall prior toy matches
        curColGroup = colGroups[relevant_val]
        prior_count = np.searchsorted(curColGroup['Date_Made'], created_date)
        prior_toys = curColGroup[:prior_count]

        # Now find the ones that were sold
        prior_sold_count = prior_toys['Was_Sold'].values.sum()

        # Now make the calculation and append to the list
        if prior_count == 0:
            ratio = 0
        else:
            ratio = prior_sold_count / prior_count
        ratio_list.append(ratio)
        
    # Store the calculation in the original df_a
    df_a[col + '_Prior_Sold_Ratio'] = ratio_list
    print(col + ' - Finished')

This is 5.5 times faster on my machine.

The iteration of the pandas dataframe is a major source of slowdown. Indeed, prior_toys['Was_Sold'] takes half the computation time because of the huge overhead of pandas internal function calls repeated Na times... Using Numba may help to reduce the cost of the slow iteration. Note that the complexity can be increased by splitting colGroups in subgroups ahead of time (O(Na log Nb)). This should help to completely remove the overhead of prior_sold_count. The resulting program should be about 10 time faster than the original one.

I had originally started the problem in SQL but found that the query plan just created monsterous combinations in memory still. I'm not convinced that's viable there unless iterating as well. The pre-computed subgroups though is an absolutely great take, so accepting. Two follow-ups though: 1) why the difference in `['Was_Sold']` vs `.Was_Sold`? 2) I miss how `prior_toys['Was_Sold']` called `Na` multiple times? (I believe in your notation `Na` was from `df_a`, but `prior_toys['Was_Sold']` should be coming from a filtered version of `df_b` (unless you just mean `Na` times because of the loop?) — Josh, Dec 15 '21 at 20:10
`['Was_Sold']` is faster probably because runtime-defined attributes introduce an overhead in the interpreter itself (much code as to be executed and more function calls). For (2), you iterate over `df_a` so every instruction in the loop is called `Na` times. I think the sum over `prior_toys['Was_Sold']` runs in `O(Nb)` since `prior_toys` comes from `df_b` and I think the filtering do not impact the complexity. It might be much smaller in practice. I just computed an upper-bound complexity. It is not easy to find the lower-bound. The amortized complexity may be better (not easy to compute). — Jérôme Richard, Dec 15 '21 at 23:52
understood! I didn't realize the . vs bracket notation had such differences. I'll read more and thanks! — Josh, Dec 16 '21 at 14:35

What is the most efficient method for calculating per-row historical values in a large pandas dataframe?

1 Answers1

Linked