1

Below first two table is cat1 and cat2 table basically, I hope to left join cat1 with cat2, where cat1 date is between start/end date of cat2. Last table is the outcome, how should I achieve this efficiently? Note that I have 100k records in real table. Btw, If we have Date of cat1 falling in two StartDate & EndDate (two rows of cat2), then it will ended up having two rows in the output table.

enter image description here

Please find the code below for cat1 and cat2

import pandas as pd
from datetime import date

cat1=pd.DataFrame({
    "CatID":[1, 2, 3],
    "Date":[date(2019,10,1), date(2019,11,1), date(2019,12,1)]
})

cat2=pd.DataFrame({
    "Interesting":[1, 2, 3,4],
    "StartDate":[date(2019,10,1), date(2019,11,1),date(2019,11,1), date(2018,10,6)],
    "EndDate":[date(2019,10,30), date(2019,11,30),date(2019,11,30), date(2018,10,6)]
})

I have tried the solution here How to merge two dataframe with column in overlapping datetime range

# df2[["start"]] is a column vector of size m and df1.timestamp.values is row 
## vector of size n then broad cast will result matrix of shape m,n which is 
## result of comparing each pair of m and n
compare = (cat2[["StartDate"]].values<=cat1.Date.values) & (cat2[["EndDate"]].values>=cat1.Date.values)

## get cell numbers which is in range 0 to matrix size which meets the condition
ind = np.arange(len(cat1)*len(cat2))[compare.ravel()]


## calculate row and column index from cell number
pd.concat([cat2.iloc[ind//len(cat1)].reset_index(drop=True), cat1.iloc[ind%len(cat1)].reset_index(drop=True)], axis=1, sort=False)

Which I got

enter image description here

it is not what I want, as it lacks of the last row (as what I need is left join).

william007
  • 17,375
  • 25
  • 118
  • 194

2 Answers2

0

You can use a cross join and series.between():

#convert to datetime
cat1['Date']=pd.to_datetime(cat1['Date'])

cat2[['StartDate','EndDate']]=cat2[['StartDate','EndDate']].apply(pd.to_datetime)

m=cat1.assign(k=1).merge(cat2.assign(k=1)).drop('k',1)
final=(m[m.Date.between(m.StartDate,m.EndDate)]
  .merge(cat1,on=['Date','CatID'],how='right').drop(['StartDate','EndDate'],1))

   CatID       Date  Interesting
0      1 2019-10-01          1.0
1      2 2019-11-01          2.0
2      2 2019-11-01          3.0
3      3 2019-12-01          NaN
anky
  • 74,114
  • 11
  • 41
  • 70
  • Thanks, I tried your solution with two rows in cat2 that matched in cat1, and it throws error, please find the code above in the question. – william007 Nov 20 '19 at 14:10
  • @william007 yes because of the duplicate date range, what would be your expected output here(since you have 2 same date range with different `Interesting` number )? can you update the question as well, thanks – anky Nov 20 '19 at 14:19
  • 1
    Hi anky_91, I have updated the questions with desired output. – william007 Nov 20 '19 at 14:26
  • @william007 I have updated my answer. though it may be a little slow, but can you test please? thank you – anky Nov 20 '19 at 16:41
0

What you produced is an 'inner' join. Build the right columns as a separate data frame before you concat them. Make sure that the number of rows are equal.

# df2[["start"]] is a column vector of size m and df1.timestamp.values is row 
## vector of size n then broad cast will result matrix of shape m,n which is 
## result of comparing each pair of m and n
compare = (cat2[["StartDate"]].values<=cat1.Date.values) & (cat2[["EndDate"]].values>=cat1.Date.values)


## get cell numbers which is in range 0 to matrix size which meets the condition
ind_old = np.arange(len(cat1)*len(cat2))[compare.ravel()]
ind_new = np.arange(len(cat1)*len(cat2))[compare.T.ravel()]
cat1_new = pd.DataFrame(np.nan, index=list(cat2.index), columns=cat1.columns)

## match the rows of cat1 to those of cat2
cat1_new.iloc[ind_new%len(cat2)] = cat1.iloc[ind_old%len(cat1)].reset_index(drop=True)


## calculate row and column index from cell number
print(pd.concat([cat2, cat1_new], axis=1, sort=False))

## concat 
print(pd.concat([cat2, cat1_new], axis=1, sort=False))

#    Interesting   StartDate     EndDate  CatID        Date
# 0            1  2019-10-01  2019-10-30    1.0  2019-10-01
# 1            2  2019-11-01  2019-11-30    2.0  2019-11-01
# 2            3  2019-11-01  2019-11-30    2.0  2019-11-01
# 3            4  2018-10-06  2018-10-06    NaN         NaN

magraf
  • 420
  • 5
  • 8