5

I'm trying to merge two Pandas dataframes on two columns. One column has a unique identifier that could be used to simply .merge() the two dataframes. However, the second column merge would actually use .merge_asof() because it would need to find the closest date, not an exact date match.

There is a similar question here: Pandas Merge on Name and Closest Date, but it was asked and answered nearly three years ago, and merge_asof() is a much newer addition.

I asked a similar here question a couple months ago, but the solution only needed to use merge_asof() without any exact matches required.

In the interest of including some code, it would look something like this:

df = pd.merge_asof(df1, df2, left_on=['ID','date_time'], right_on=['ID','date_time'])

where the ID's will match exactly, but the date_time's will be "near matches".

Any help is greatly appreciated.

Community
  • 1
  • 1
elPastor
  • 8,435
  • 11
  • 53
  • 81
  • I'm familiar with the guidelines, but those who know what I'm asking don't need anything more than what I've written. – elPastor Feb 17 '17 at 12:00
  • Could you please post your solution? I am facing the same issues. Thanks! – rp1 Apr 02 '19 at 01:23

2 Answers2

4

Consider merging first on the ID and then run a DataFrame.apply to return highest date_time from first dataframe on matched IDs less than the current row date_time from second dataframe.

# INITIAL MERGE (CROSS-PRODUCT OF ALL ID PAIRINGS)
mdf = pd.merge(df1, df2, on=['ID'])

def f(row):
    col = mdf[(mdf['ID'] == row['ID']) & 
              (mdf['date_time_x'] < row['date_time_y'])]['date_time_x'].max()
    return col

# FILTER BY MATCHED DATES TO CONDITIONAL MAX
mdf = mdf[mdf['date_time_x'] ==  mdf.apply(f, axis=1)].reset_index(drop=True)

This assumes you want to keep all rows of df2 (i.e., right join). Simply flip _x / _y suffixes for left join.

Parfait
  • 104,375
  • 17
  • 94
  • 125
  • 1
    Parfait - I think I actually solved my own problem and I believe adding the kwarg: `by = 'ID'` was the key. Thank you! – elPastor Feb 20 '17 at 20:02
  • I'm not quite sure of your answer or how I helped. Feel free to answer your own question for future readers and wait a few days to accept your own. I was waiting to hear response of how this solution did not work. Note that this only picks max date mirroring `asof` (not nearest). – Parfait Feb 20 '17 at 20:47
0

The current solution would work on a small dataset but if you have hundreds of rows... I'm afraid not.

So, what you want to do is as follows:

df = pd.merge_asof(df1, df2, on = 'date_time', by = 'ID', direction = 'nearest')
ga1996
  • 125
  • 1
  • 10