59

I am having trouble while trying to understand the concept of semi-join and how it is different from conventional join. I have tried some article already but not satisfied with the explanation, could someone please help me to understand it?

johncitizen
  • 117
  • 1
  • 2
  • 14
Henu
  • 1,622
  • 2
  • 22
  • 27
  • 2
    It's a check for the existence of any tuples in a relation that match a condition. http://awads.net/wp/2007/05/01/back-to-basics-anti-joins-and-semi-joins/ – David Aldridge Feb 15 '17 at 12:41

3 Answers3

60

Simple example. Let's select students with grades using left outer join:

SELECT DISTINCT s.id
FROM  students s
      LEFT JOIN grades g ON g.student_id = s.id
WHERE g.student_id IS NOT NULL

Now the same with left semi-join:

SELECT s.id
FROM  students s
WHERE EXISTS (SELECT 1 FROM grades g
              WHERE g.student_id = s.id)

The latter is generally more efficient (depending on concrete DBMS and query optimizer).

Iurii Ant
  • 877
  • 8
  • 15
  • 3
    Why does it always say `SELECT 1` in the first part of a `WHERE EXISTS` statement? – Brent Brewington Oct 04 '18 at 20:20
  • 9
    @BrentB something has to be SELECTed, but there's no need for actual data (a decent optimizer should be able to optimize away any alternative like `SELECT *` so it's more of reinforcing the intent for human peers that there's no use of the data) – Robert Monfera Jun 07 '19 at 23:45
  • 11
    @JamesWierzba A query optimizer might recognize that they are identical and use the same query plan to run either query. However, conceptually, in the first query, you would first find all of the grades for each student and then delete any students which had no grades and then remove duplicate students until you’re left with unique ones. In the second query, you go student by student and check if there is at least one grade for that student. So no need to enumerate all grades or deduplicate the student IDs at the end. Again, an optimizer might be smart enough to use the same algorithm for both – binki Sep 11 '19 at 16:03
  • 9
    `x left join y on c where y.a is not null` is `x inner join y on c`. Left join is unneeded & misleading in the 1st query. "much more efficient" False. Depends on the DBMS optimizer/implementation, and the optimization is simple. – philipxy Jan 24 '22 at 23:19
  • @philipxy , I guess the example was started in good faith as anti-join vs semi anti join and then the negation got removed. So, 1st example should have been ‘x left join y on c where y.x_id is null’ and second query should be an anti semi join, either with exist clause or as the difference set operator using the keywords minus or except. Again the terms are pretty fuzzy and most of the time query engines would come up with same execution plan. If they do not, in my experience, it is often the exists clauses that come on top. – Paddy Aug 26 '23 at 16:11
  • @Paddy It is clear from all 3 paragraphs that both queries are semi-join--per the question. Maybe the answerer used LEFT because there are left & right anti-joins & "anti-join" without left/right means left anti-join. But left & right anti-joins are projections of INNER JOIN & one can interchange s & g to get right from left--just as one would with the 2nd query, since it has no joins. So it's just distracting/misleading to use LEFT/RIGHT JOIN just because they have explicit LEFT/RIGHT. PS "anti semi join" is a recent unnecessary misleading MS term for anti join. PS Your comment is unclear. – philipxy Aug 28 '23 at 04:19
21

As far as I know SQL dialects that support SEMIJOIN/ANTISEMI are U-SQL/Cloudera Impala.

SEMIJOIN:

Semijoins are U-SQL’s way filter a rowset based on the inclusion of its rows in another rowset. Other SQL dialects express this with the SELECT * FROM A WHERE A.key IN (SELECT B.key FROM B) pattern.

More info Semi Join and Anti Join Should Have Their Own Syntax in SQL:

“Semi” means that we don’t really join the right hand side, we only check if a join would yield results for any given tuple.

-- IN
SELECT *
FROM Employee
WHERE DeptName IN (
  SELECT DeptName
  FROM Dept
)

-- EXISTS
SELECT *
FROM Employee
WHERE EXISTS (
  SELECT 1
  FROM Dept
  WHERE Employee.DeptName = Dept.DeptName
)

EDIT:

Another dialect that supports SEMI/ANTISEMI join is KQL:

kind=leftsemi (or kind=rightsemi)

Returns all the records from the left side that have matches from the right. The result table contains columns from the left side only.

let t1 = datatable(key:long, value:string)  
[1, "a",  
2, "b",
3, "c"];
let t2 = datatable(key:long)
[1,3];
t1 | join kind=leftsemi (t2) on key

demo

Output:

key  value
1    a
3    c
Peter Vandivier
  • 606
  • 1
  • 9
  • 31
Lukasz Szozda
  • 162,964
  • 23
  • 234
  • 275
-7

As I understand, a semi join is a left join or right join:

What's the difference between INNER JOIN, LEFT JOIN, RIGHT JOIN and FULL JOIN?

So the difference between a left (semi) join and a "conventional" join is that you only retrieve the data of the left table (where you have a match on your join condition). Whereas with a full (outer) join (I think thats what you mean by conventional join), you retrieve the data of both tables where your condition matches.

Community
  • 1
  • 1
  • 12
    A left or right join is called an _outer join_. A _semi-join_ is not the same thing at all: it returns a set of rows in one table that is constrained by the existence of data in some other table, without actually drawing any data from that other table. It's implemented by `EXISTS` as shown in Iurii Ant's answer. [See Relational Algebra (Semijoin)](https://en.wikipedia.org/wiki/Relational_algebra#Semijoin_.28.E2.8B.89.29.28.E2.8B.8A.29) for the theory or [this excellent article](https://blogs.msdn.microsoft.com/craigfr/2006/07/19/introduction-to-joins/) for a SQL Server-specific introduction. – Joe Farrell Oct 23 '17 at 21:26
  • Why this wrong answer it still selected as correct? – Manoel Vilela Oct 03 '18 at 10:48
  • Nowhere near correct. Semi Join is Either implemented by `EXISTS` or by `WHERE IN`. In both cases, it behaves as a filter where the elements to be a check against can come from another table (or the same table as the upper select). – Daniel Heilper Oct 05 '20 at 10:37