Below first two table is cat1
and cat2
table
basically, I hope to left join cat1
with cat2
, where cat1
date is between start/end date of cat2
.
Last table is the outcome, how should I achieve this efficiently? Note that I have 100k records in real table.
Btw, If we have Date of cat1
falling in two StartDate & EndDate (two rows of cat2
), then it will ended up having two rows in the output table.
Please find the code below for cat1
and cat2
import pandas as pd
from datetime import date
cat1=pd.DataFrame({
"CatID":[1, 2, 3],
"Date":[date(2019,10,1), date(2019,11,1), date(2019,12,1)]
})
cat2=pd.DataFrame({
"Interesting":[1, 2, 3,4],
"StartDate":[date(2019,10,1), date(2019,11,1),date(2019,11,1), date(2018,10,6)],
"EndDate":[date(2019,10,30), date(2019,11,30),date(2019,11,30), date(2018,10,6)]
})
I have tried the solution here How to merge two dataframe with column in overlapping datetime range
# df2[["start"]] is a column vector of size m and df1.timestamp.values is row
## vector of size n then broad cast will result matrix of shape m,n which is
## result of comparing each pair of m and n
compare = (cat2[["StartDate"]].values<=cat1.Date.values) & (cat2[["EndDate"]].values>=cat1.Date.values)
## get cell numbers which is in range 0 to matrix size which meets the condition
ind = np.arange(len(cat1)*len(cat2))[compare.ravel()]
## calculate row and column index from cell number
pd.concat([cat2.iloc[ind//len(cat1)].reset_index(drop=True), cat1.iloc[ind%len(cat1)].reset_index(drop=True)], axis=1, sort=False)
Which I got
it is not what I want, as it lacks of the last row (as what I need is left join).