0

I am trying to merge 2 dataframes, with exact matching on some columns and as_of matching on some other column (typically a date). The intention is very well described in this post (I'll copy paste the main content below):

Pandas: Approximate join on one column, exact match on other columns

The post above was answered ; only it dates back from 2016, before the introduction of pandas.merge_asof. I believe there can be an easier answer now that it's been released. Brutal approach would be to merge as_of for each group of rows with the same values for the cols on which I want to merge exactly on. But is there a more elegant version?

Precise description of desired input and outputs:

Inputs

df1 = pd.DataFrame({'index': ['a1','a2','a3','a4'], 'col1': ['1232','432','432','123'], 'col2': ['asd','dsa12','dsa12','asd2'], 'col3': ['1','2','2','3'], 'date': ['2010-01-23','2016-05-20','2010-06-20','2008-10-21'],}).set_index('index')

df1
Out[430]: 
       col1   col2 col3        date
index                              
a1     1232    asd    1  2010-01-23
a2      432  dsa12    2  2016-05-20
a3      432  dsa12    2  2010-06-20
a4      123   asd2    3  2008-10-21

df2 = pd.DataFrame({'index': ['b1','b2','b3','b4'], 'col1': ['132','432','432','123'], 'col2': ['asd','dsa12','dsa12','sd2'], 'col3': ['1','2','2','3'], 'date': ['2010-01-23','2016-05-23','2010-06-10','2008-10-21'],}).set_index('index')

df2
Out[434]: 
      col1   col2 col3        date    b_col
index                             
b1     132    asd    1  2010-01-23        1
b2     432  dsa12    2  2016-05-23        2
b3     432  dsa12    2  2010-06-10        3
b4     123    sd2    3  2008-10-21        4

Outputs:

       col1   col2 col3        date b_col
index                                                     
a2      432  dsa12    2  2016-05-20     2
a3      432  dsa12    2  2010-06-20     3

NOTE 1: the reason why I need to do this is that I need something like groupby(...)[...].rolling(...).transform(...) with latency which doesn't seem to exist yet, unless I am missing something?

NOTE 2: I want to avoid computing all couples and then filtering as the dataframe may get too big.

Vincent
  • 482
  • 6
  • 22

1 Answers1

0

I have tried to get closer to your problem. However, I did not try merge_asof but merge. I hope this approach can help you:

import numpy as np
import pandas as pd


df1 = pd.DataFrame({'index': ['a1', 'a2', 'a3', 'a4'], 'col1': ['1232', '432', '432', '123'],
                'col2': ['asd', 'dsa12', 'dsa12', 'asd2'], 'col3': ['1', '2', '2', '3'],
                'date': ['2010-01-23', '2016-05-20', '2010-06-20', '2008-10-21'],
                }).set_index('index')

df2 = pd.DataFrame({'index': ['b1', 'b2', 'b3', 'b4'], 'col1': ['132', '432', '432', '123'],
                'col2': ['asd', 'dsa12', 'dsa12', 'sd2'], 'col3': ['1', '2', '2', '3'],
                'date': ['2010-01-23', '2016-05-23', '2010-06-10', '2008-10-21'],
                }).set_index('index')


columns = ['col1', 'col2', 'col3']


                                                                                                                                 
new_dic = pd.merge(df1, df2, on=columns, right_index=True).drop_duplicates(subset=['date_x']).drop(labels='date_y', axis=1)          
                                                                                                                             
                                                                                                                  

print(new_dic)
Higs
  • 384
  • 2
  • 7
  • Hi @Higs ; thanks for the effort. It gets the desired output, but only in this very specific case. Otherwise you are missing my point. Which shows that my example is not good enough at all. Will try and come up with a better one. Also, I want to avoid computing all couples and then filtering as the dataframe may get too big. – Vincent Oct 20 '20 at 20:27
  • I have changed my example for more clarity – Vincent Oct 20 '20 at 20:36
  • Can you please send me some test cases so that I know roughly what you need the program for and can test it. – Higs Oct 20 '20 at 22:56