Say I have two pandas dataframes (df_a & df_b), where each row represents a toy and features about that toy. Some pretend features:
- Was_Sold (Y/N)
- Color
- Size_Group
- Shape
- Date_Made
Say df_a is relatively small (10s of thousands of rows) and df_b is relatively large (>1 million rows).
Then for every row in df_a, I want to:
- Find all the toys from df_b with the same type as the one from df_a (e.g. the same color group)
- The df_b toys must also be made before the given df_a toy
- Then find the ratio of those sold (So count sold / count all matched)
What is the most efficient means to make those per-row calculations above?
The best I've came up with so far is something like the below. (Note code might have an error or two as I'm rough typing from a different use case)
cols = ['Color', 'Size_Group', 'Shape']
# Run this calculation for multiple features
for col in cols:
print(col + ' - Started')
# Empty list to build up the calculation in
ratio_list = []
# Start the iteration
for row in df_a.itertuples(index=False):
# Relevant values from df_a
relevant_val = getattr(row, col)
created_date = row.Date_Made
# df to keep the overall prior toy matches
prior_toys = df_b[(df_b.Date_Made < created_date) & (df_b[col] == relevant_val)]
prior_count = len(prior_toys)
# Now find the ones that were sold
prior_sold_count = len(prior_toys[prior_toys.Was_Sold == "Y"])
# Now make the calculation and append to the list
if prior_count == 0:
ratio = 0
else:
ratio = prior_sold_count / prior_count
ratio_list.append(ratio)
# Store the calculation in the original df_a
df_a[col + '_Prior_Sold_Ratio'] = ratio_list
print(col + ' - Finished')
Using .itertuples()
is useful, but this is still pretty slow. Is there a more efficient method or something I'm missing?
EDIT Added the below script which will emulated data for the above scenario:
import numpy as np
import pandas as pd
colors = ['red', 'green', 'yellow', 'blue']
sizes = ['small', 'medium', 'large']
shapes = ['round', 'square', 'triangle', 'rectangle']
sold = ['Y', 'N']
size_df_a = 200
size_df_b = 2000
date_start = pd.to_datetime('2015-01-01')
date_end = pd.to_datetime('2021-01-01')
def random_dates(start, end, n=10):
start_u = start.value//10**9
end_u = end.value//10**9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
df_a = pd.DataFrame(
{
'Color': np.random.choice(colors, size_df_a),
'Size_Group': np.random.choice(sizes, size_df_a),
'Shape': np.random.choice(shapes, size_df_a),
'Was_Sold': np.random.choice(sold, size_df_a),
'Date_Made': random_dates(date_start, date_end, n=size_df_a)
}
)
df_b = pd.DataFrame(
{
'Color': np.random.choice(colors, size_df_b),
'Size_Group': np.random.choice(sizes, size_df_b),
'Shape': np.random.choice(shapes, size_df_b),
'Was_Sold': np.random.choice(sold, size_df_b),
'Date_Made': random_dates(date_start, date_end, n=size_df_b)
}
)