Efficient way to merge two large dataframes based on a condition

Question

I have two dataframes like as shown below. Already referred the posts here, here, here and here. Don't mark it as duplicate

id,id2,app_date
1,'A',20/3/2017
1,'A',28/8/2017
3,'B',18/10/2017
4,'C',15/2/2017

tf = pd.read_clipboard(sep=',')
tf['app_date'] = pd.to_datetime(tf['app_date'],dayfirst=True)

id,valid_from,valid_to,s_flag
1,20/1/2017,30/4/2017,0
1,28/11/2017,15/2/2018,1
1,18/12/2017,24/2/2018,0
2,15/7/2017,15/11/2017,1
2,2/2/2017,2/6/2017,0
2,11/5/2016,11/6/2016,1

df = pd.read_clipboard(sep=',')
df['valid_from'] = pd.to_datetime(df['valid_from'],dayfirst=True)
df['valid_to'] = pd.to_datetime(df['valid_to'],dayfirst=True)

I would like to do the below

a) check whether tf['app_date'] is within the df['valid_from'] and df['valid_to'] for matching id

b) If yes, then copy the column s_flag to tf dataframe for matching id

I tried the below but am not sure whether the below is efficient for million records plus dataframes

t1 = tf.merge(df, how = 'left',on=['id'])
t1 = t1.loc[(t1.app_date >= t1.valid_from) & (t1.app_date <= t1.valid_to),['id','s_flag','app_date']]
tf.merge(t1, how = 'inner',on=['id','app_date'])

While the above works in sample data, but in real data, for some records, I encounter issues like below

.

You can see that 9/1/2017 approval date doesn't meet the condition for the 2nd and 3rd row but it is still returned as output. This is incorrect.

I expect my output to be like as shown below

    id  app_date    s_flag
0   1   2017-03-20  0.0
2   3   2017-10-18  NaN
3   4   2017-02-15  NaN

"not sure whether it is fool proof and will work for scenarios (huge datasets)" - test it and find out. It's not that hard to generate fake data with millions of rows. I don't see anything wrong with the code you've got. — John Zwinck, Mar 13 '22 at 10:21
btw, I already tried in real dataframe worth million plus records and it's just loading but no output yet. — The Great, Mar 13 '22 at 10:31
The issue is the input dataframe has repeating `ids` but different `id2s`. — The Great, Mar 13 '22 at 11:12

Efficient way to merge two large dataframes based on a condition

0 Answers0