1

I am trying to merge two large data frames based on two common columns in these data frames. there is a small attempt and debate here but no promising solution

df1.year<=df2.year(same or later year to be manufactured)

df1.maker=df2.maker AND df1.location=df2.location

I prepared a small mock data to explain:

first data frame:

data = np.array([[2014,"toyota","california","corolla"], 
                 [2015,"honda"," california", "civic"], 
                 [2020,"hyndai","florida","accent"], 
                 [2017,"nissan","NaN", "sentra"]]) 


df = pd.DataFrame(data, columns = ['year', 'make','location','model'])
df

second data frame:

data2 = np.array([[2012,"toyota","california","airbag"], 
                 [2017,"toyota","california", "wheel"], 
                 [2022,"hyndai","newyork","seat"], 
                 [2017,"nissan","london", "light"]]) 
 

df2 = pd.DataFrame(data2, columns = ['year', 'make','location','id'])
df2

desired output:

data3 = np.array([[2017,"toyota",'corolla',"california", "wheel"]]) 

df3 = pd.DataFrame(data3, columns = ['year', 'make','model','location','id'])
df3

I tried to use the below approach but it is to slow and also not so accurate:

df4= pd.merge(df,df2, on=['location','make'], how='outer')
df4=df4.dropna()
df4['year'] = df4.apply(lambda x : x['year_y'] if x['year_y'] >= x['year_x'] else "0", axis=1)

Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
hilo
  • 116
  • 11
  • So you know how to merge? If yes, please show your attempt. And then also make it clear in the question that your attempt works but too slow. – SomeDude Aug 27 '22 at 01:46
  • please see my solution that fails – hilo Aug 27 '22 at 02:04
  • What do you mean by "not so accurate"? – Jia Gao Aug 27 '22 at 02:15
  • the approach I have here seems to work but I am not sure if it is the right way to do. I mean, look at the code, isn't it ugly and inefficient? I am also worriedif the approach has any false negatives to be dropped.... – hilo Aug 27 '22 at 02:17

3 Answers3

2

You can achieve it with a merge_asof (one to one left merge) and dropna:

# ensure numeric year
df['year'] = pd.to_numeric(df['year'])
df2['year'] = pd.to_numeric(df2['year'])

(pd.merge_asof(df.sort_values('year'),
               df2.sort_values('year')
                  .assign(year2=df2['year']),
               on='year', by=['make', 'location'],
               direction='forward')
   .dropna(subset='id')
   .convert_dtypes('year2')
 )

NB. The intermediate is the size of df.

Output:

   year    make    location    model     id  year2
0  2014  toyota  california  corolla  wheel   2017

one to many

As merge_asof is a one to one left join, if you want a one to many left join (or right join), you can invert the inputs and the direction.

I added an extra row for 2017 to demonstrate the difference.

   year    make    location          id
0  2012  toyota  california      airbag
1  2017  toyota  california       wheel
2  2017  toyota  california  windshield
3  2022  hyndai     newyork        seat
4  2017  nissan      london       light

Right join:

(pd.merge_asof(df2.sort_values('year'),
               df.sort_values('year'),
               on='year', by=['make', 'location'],
               direction='backward')
   .dropna(subset='model')
 )

NB. The intermediate is the size of df2.

Output:

   year    make    location          id    model
1  2017  toyota  california       wheel  corolla
2  2017  toyota  california  windshield  corolla
mozway
  • 194,879
  • 13
  • 39
  • 75
  • Shouldn't merge_asof specify direction='forward' as per condition in OP df1.year <= df2.year? From doc : *A “forward” search selects the first row in the right DataFrame whose ‘on’ key is greater than or equal to the left’s key.* – SomeDude Aug 27 '22 at 02:21
  • @SomeDude of course! Plus a tweak for the correct year – mozway Aug 27 '22 at 02:25
  • Upvoted. But as you might know merge for large dfs in pandas showed to be much slower (especially for 'outer' merges) than with R's datatable. – SomeDude Aug 27 '22 at 02:40
  • 1
    @SomeDude well if the dataset is huge `pandas` might not be the best tool for that ;) – mozway Aug 27 '22 at 02:45
  • I have a last question, what if we have more than one column in place of 'id'... I have 3 more columns and I am trying to figure it out since yesterday. it would be great if you share some insights. – hilo Aug 28 '22 at 04:09
  • Not sure I follow. The best might be to open a new question with a detailed example. – mozway Aug 28 '22 at 06:39
0

this should work:

df4= pd.merge(df,df2, on=['location','make'], how='inner')
df4.where(df4.year_x<=df4.year_y).dropna()

Output:

    year_x  make    location    model   year_y  id
1   2014    toyota  california  corolla 2017    wheel
  • 1
    This creates however all combinations of location/make which is precisely what OP wants to avoid – mozway Aug 27 '22 at 02:34
0

Try this code (here 'make' and 'location' are common columns):

df_outer = pd.merge(df, df2, on=['make', 'location'], how='inner') 
df3 = df_outer[df['year'] <= df2['year']]
Timus
  • 10,974
  • 5
  • 14
  • 28