1

I am learning about the two functions merge and join in pandas.
I have a lot of SQL experience so I am trying to understand how are similar things done in Pandas.
So... OK, I was watching a lecture about pandas merge/join but it raised more questions in me than it answered I'm afraid. I have two main questions.

  1. How are merge and join different in pandas, even their docs seem quite similar? Not sure why pandas needs both. Even this page doesn't answer that, it seems.

https://pandas.pydata.org/docs/getting_started/comparison/comparison_with_sql.html

  1. Can I join (like in SQL join) but based on inequalities
    between two columns or based on more complicated logical conditions e.g.

     select * from 
     t1 join t2 on t1.a < t2.b+1 and upper(t1.s1) > upper(t2.s2)
    
     select * from 
     t1 join t2 on t1.a < t2.b+1 or upper(t1.s1) > upper(t2.s2 || t2.s3)  
    
     select * from 
     t1 join t2 on t1.a < t2.b+10 or len(t1.s1) > coalesce(t2.c, 5)
    
     select * from 
     t1 join t2 on ( t1.a * t1.a < t2.b + 10 ) or ( upper(t1.s1) > REGEXP_REPLACE(t2.s2,'(.*) (.*)','\2, \1') ) 
    

How can we translate these three SQL queries e.g. to pandas joins/merges?

Note 1: a,b,c are some numeric columns, while s1, s2, s3 are string columns

The SQL flavor I used was Postgres in particular but my question is generic.
The regexp expression in the 4th example replaces the first and last name (if we assume t2.s2 has a full name in it e.g. 'John Smith'). || is string concatenation in this SQL dialect.

I am just trying to understand what are my limits in pandas (if any)
when compared to SQL w.r.t. using merge/join.

Note 2: I realize I can pre-massage my two data frames (left/right) so that I simplify the above join conditions. But say we don't want to do this for whatever reason. Then... is it possible to express the above SQL join semantics in pandas.

peter.petrov
  • 38,363
  • 16
  • 94
  • 159
  • 2
    `join` is specifically for joining the two datasets together on a common `index`. It is simply more syntactical than `merge` when joining on index, but you can also do so with `merge`. With `merge`, you have more flexibility and can `merge` two dataframes together on `columns`, `index` or both. I have never used a concept of `index` with `SQL`, so if you have never used `pandas`, then the concept of `index` might be confusing. In terms of limits, you shouldn't really have any limits with `pandas` compared to `SQL` for the operations. – David Erickson Jan 04 '21 at 23:44
  • 1
    If you can provide some sample input data for `t1` and `t2` and expected output, then I can give one of your SQL queries above a go. I'm not very familiar with joining on inequalities, so it would be helpful to see input and desired output. I don't think the answer will be as straightforward with python. For example, this solution doesn't even use `merge` or `join`: https://stackoverflow.com/questions/46525786/how-to-join-two-dataframes-for-which-column-values-are-within-a-certain-range. You would have to create a common `range` column and merge on that or something. – David Erickson Jan 05 '21 at 01:12

0 Answers0