Need a way to do many-to-one merge with pandas.merge_asof()

Question

I have a similar question to that of the post listed at the following link: pandas merging based on a timestamp which do not match exactly

However I need to do a match of many-to-one while having the functionality of pandas.merge_asof().

I have two dataframes, df1 & df2.

import pandas as pd
import numpy as np
from io import StringIO

dtc = [['CALL_DATE']]
df1 = pd.read_csv(StringIO(u'''
CALL_DATE,customer,status
2017-01-03 14:12:58,70892,P
2017-01-06 20:00:25,70892,P
2017-01-07 09:42:58,70892,X
2017-01-03 13:56:41,70928,N
2017-01-07 15:16:26,70928,C
2017-01-03 15:39:11,71075,U
2017-01-03 15:46:29,71075,N
'''))

df2 = pd.read_csv(StringIO(u'''
CALL_DATE,customer,Note
2017-01-03 14:09:00,70892,Call to return
2017-01-06 19:59:00,70892,Wrong Item shipped
2017-01-07 09:36:00,70892,Survey denied
2017-01-03 13:56:00,70928,TGGT
2017-01-03 13:53:00,70928,Open issue
2017-01-03 13:56:00,70928,No Record of listings
2017-01-07 15:15:00,70928,Need Translator
2017-01-07 15:16:00,70928,rescheduled appointment 
2017-01-03 15:39:11,71075,New Contact
2017-01-03 15:46:29,71075,open membership
2017-01-03 15:46:29,71075,recurring delivery scheduled 
'''))

df1['CALL_DATE'] = pd.to_datetime(df1['CALL_DATE'], format = '%Y-%m-%d %H:%M:%S')
df2['CALL_DATE'] = pd.to_datetime(df2['CALL_DATE'], format = '%Y-%m-%d %H:%M:%S')

These two data frames need to be merged with the ending results being something similar to what follows:

df3 = pd.read_csv(StringIO(u'''
2017-01-03 14:12:58,70892,P,2017-01-03 14:09:00,Call to return
2017-01-06 20:00:25,70892,P,2017-01-06 19:59:00,Wrong Item shipped
2017-01-07 09:42:58,70892,P,2017-01-07 09:36:00,Survey denied
2017-01-03 13:56:41,70928,N,2017-01-03 13:56:00,TGGT 
2017-01-03 13:56:41,70928,N,2017-01-03 13:53:00,Open issue
2017-01-03 13:56:41,70928,N,2017-01-03 13:56:00,70928,No Record of listings
2017-01-07 15:16:26,70928,C,2017-01-07 15:15:00,Need Translator
2017-01-07 15:16:26,70928,C,2017-01-07 15:16:00,rescheduled appointment
2017-01-03 15:39:11,71075,U,2017-01-03 15:39:11,New Contact
2017-01-03 15:46:29,71075,N,2017-01-03 15:46:29,open membership
2017-01-03 15:46:29,71075,N,2017-01-03 15:46:29,recurring delivery schedule
'''))

In the sample data provided the times differences are really small, but there are plenty of cases when the time difference can be several hours almost a whole day. I am trying to match a note with the closest customer entry for that customer. Also df2 entries can come before or after (time-wise) df1 entries.

When I do pandas.merge_asof(), it is just doing a one-to-one merge and I am losing notes that should go with a customer's file.

`df3` row 2 (0-based) for customer *70892*, why do you want to match *'2017-01-06 20:00:25'* with *'2017-01-07 09:36:00'* when there is a *'2017-01-07 09:42:58'* from `df1` which is a much closer match to *'2017-01-07 09:36:00'* in `df2`? — Indominus, Dec 14 '18 at 00:48
@Indominus, you are right, sorry that was my mistake when making the example on here. I have updated it with the corrections. Thanks for pointing that out. — nbas, Dec 14 '18 at 05:09

score 0 · Answer 1 · answered Dec 14 '18 at 17:24

Maybe all you had to do was switch the order of dataframes in your merge_asof call? Because this worked for me:

df1.sort_values(by='CALL_DATE', inplace=True)
df2.sort_values(by='CALL_DATE', inplace=True)

df1['STATUS_DATE'] = df1.CALL_DATE  # preserves times from df1

df3 = pd.merge_asof(df2, df1, on='CALL_DATE', by='customer', direction='nearest')

Calling print(df3) outputs (on my machine):

             CALL_DATE  customer                           Note status  \
0  2017-01-03 13:53:00     70928                     Open issue      N   
1  2017-01-03 13:56:00     70928                           TGGT      N   
2  2017-01-03 13:56:00     70928          No Record of listings      N   
3  2017-01-03 14:09:00     70892                 Call to return      P   
4  2017-01-03 15:39:11     71075                    New Contact      U   
5  2017-01-03 15:46:29     71075                open membership      N   
6  2017-01-03 15:46:29     71075  recurring delivery scheduled       N   
7  2017-01-06 19:59:00     70892             Wrong Item shipped      P   
8  2017-01-07 09:36:00     70892                  Survey denied      X   
9  2017-01-07 15:15:00     70928                Need Translator      C   
10 2017-01-07 15:16:00     70928       rescheduled appointment       C   

           STATUS_DATE  
0  2017-01-03 13:56:41  
1  2017-01-03 13:56:41  
2  2017-01-03 13:56:41  
3  2017-01-03 14:12:58  
4  2017-01-03 15:39:11  
5  2017-01-03 15:46:29  
6  2017-01-03 15:46:29  
7  2017-01-06 20:00:25  
8  2017-01-07 09:42:58  
9  2017-01-07 15:16:26  
10 2017-01-07 15:16:26

If the column order bugs you, you can always reorder the columns.

Thanks! It works. I feel like an idiot for not trying the simple switching the order of the input dataframes. — nbas, Dec 17 '18 at 12:56
hmmm, now it seems if I have records from df1 that are not linked to df2, then those records are lost. I still need to keep the records from df1. — nbas, Dec 17 '18 at 13:09
df3 is a dataframe that pairs entries from df1 and df2. so if df1 has entries with no corresponding df2 entry, what data should go in the df2 spot for those df3 entries? just NaN’s? — CrepeGoat, Dec 17 '18 at 17:17
actually, it would be swell if you could update the example in your question to cover that case, so I know what results you’re looking for. — CrepeGoat, Dec 17 '18 at 17:19

Need a way to do many-to-one merge with pandas.merge_asof()

1 Answers1