8

So I've got 2 data-frames I'd like to merge together.

I'm merging on 3 columns, 2 is an easy join.

joined_df = pd.merge(df1, df2, how='left', on=['name', 'city'])

I want this to be using a third column, but it's going to be a comparison, something like this:

joined_df = pd.merge(df1, df2, how='left',
on=['name', 'city', 'df1.year' >= 'df2.year_min'])

Not sure what the right syntax is here.

If it was SQL, it would be easy for me.

SELECT * FROM df1
JOIN df2 on (df1.name = df2.name and df1.year = df2.year and df1.year > df2.year_min)

Any assistance?

cs95
  • 379,657
  • 97
  • 704
  • 746
firestreak
  • 377
  • 4
  • 17

2 Answers2

6

Pandas merge only supports equi-joins. You'll need to add a second step that filters the result, something like this:

joined_df = df1.merge(df2, how='left', on=['name', 'city'])
joined_df = joined_df[joined_df.year > joined_df.year_min]
cs95
  • 379,657
  • 97
  • 704
  • 746
  • Okay, I was hoping to be able to do that comparison, but taking your work it's not possible as dataframes. – firestreak May 28 '18 at 21:25
  • 2
    @firestreak, if you want to do it in one line, you can chain [`query`](http://pandas.pydata.org/pandas-docs/version/0.21/generated/pandas.DataFrame.query.html) like this: `df1.merge(df2, on=['name', 'city']).query('year > year_min')`. Also, `how=left` would do a `left join`, not an `inner join` as you show in your sql snippet – Haleemur Ali May 28 '18 at 21:31
  • Does anyone know how this affects the memory footprint? This seems like a part cartesian product that is filtern down again later. – CheradenineZK Mar 15 '21 at 15:33
  • @cheradenineZK I don’t think it is a Cartesian join since there is an equality merge on columns already existing; this should prune the data frame before getting to the non equi join – sammywemmy Oct 10 '21 at 12:04
  • 1
    @sammywemmy, well of course you're right, it is no perfect Cartesian join. But sadly in practice this doesn't make much of a difference for large fact tables with many timestamp per join key. It just blows the memory quickly. I tested it with sample files and in my case this just wasn't feasable. – CheradenineZK Oct 19 '21 at 07:21
  • I think a better way would be to get the join indices/positions, run the filters on subsequent conditions before creating the final dataframe; it should be more efficient. By how much? Really can’t tell. There is a conditional join function in pyjanitor. Maybe that helps. You can have a look at R’s data.table though. If I may ask, why not use a database/SQL? – sammywemmy Oct 19 '21 at 07:42
4

You can using merge_asof, default is backward merge

pd.merge_asof(df1,df2, left_on='year',right_on='joined_df', by=['name', 'city'])
BENY
  • 317,841
  • 20
  • 164
  • 234