0

I have two pandas dataframes containing IDs and birth dates. The birth dates in dataframe 1 are only an estimation so in order to capture as many similarities as possible I am looking to match IDs in Dataframe 2 that are within 45 days of the birthdate in Dataframe 1. Example as follows:

df_A:
ID     birth_date
A_2    1-1-1945
A_3    1-1-1967
A_4    1-1-2000
A_5    1-1-1994
A_6    2-1-1994

df_B:
ID     birth_date
B_1    2-2-1945
B_2    12-25-1944
B_3    1-5-2000
B_4    1-7-1994

Expected Output:
ID_1     ID_2
A_2      B_1
A_2      B_2
A_4      B_3
A_5      B_4
A_6      B_4

Edited in more example to fully show what I would like to receive.

cbelmon
  • 1
  • 2
  • Do you need only the **single** closest match, or **all** matches within 45 days? `merge_asof` is 1:1 so it cannot do the latter. – ALollz Sep 12 '19 at 16:01
  • It needs to be all matches within 45 days. Hence I have in my example results A_2 matching with two different values from df_B. – cbelmon Sep 12 '19 at 17:22

2 Answers2

2

This is merge_asof with tolerance:

pd.merge_asof(df2.sort_values('birth_date'),df1.sort_values('birth_date'),
on='birth_date',tolerance=pd.Timedelta(45,unit='D'),direction='nearest'
         ,suffixes=('_2','_1')).drop('birth_date',1)

  ID_2 ID_1
0  B_2  A_2
1  B_1  A_2
2  B_3  A_4

Note, please convert the birth_date columns to datetime first:

df1.birth_date=pd.to_datetime(df1.birth_date)
df2.birth_date=pd.to_datetime(df2.birth_date)
anky
  • 74,114
  • 11
  • 41
  • 70
  • Thanks, but this looks to be 1:1 only. I need it so that I get all potential matches within 45 days. – cbelmon Sep 12 '19 at 17:24
0

I think I figured it out myself. Using the answer I found here: cartesian product in pandas

My code looks like this now:

df_A.loc[:, 'key'] = 1
df_B.loc[:, 'key'] = 1

cart = pd.merge(df_A, df_B, on='key') [['df_A_ID', 'df_A_birth_date', 'df_B_ID', 'df_B_birth_date']]

cart = cart[(cart['df_B_birth_date'] < cart['df_A_birth_date'] + pd.to_timedelta(45, unit='d')) 
            & (cart['df_B_birth_date'] > cart['df_A_birth_date'] + pd.to_timedelta(-45, unit='d'))].sort_values(['df_A_ID']).reset_index().drop(columns={'index'})

cbelmon
  • 1
  • 2