0

In Difference between EXISTS and IN in SQL? it's clear that lots of people think "exists is faster if the subquery returns many rows, in is faster if it returns a few rows" but I'd like to know a bit more detail about the actual implementation of each of these keywords and how it could give rise to a difference in performance based on result set size

Lots of people were claiming that EXISTS is simply looking for a true or false (as though IN isn't??) and gives up as soon as a true is found, whereas IN will "scan the entire set" - really? If I were writing a dbms and using a naive loop to establish the truth of 3 IN (1,2,3,4,5) I'm pretty sure I'd code it up such that if I found 3 after a checking 1 and 2, I wouldn't check 4 and 5 to see if they were also 3

I've a strong suspicion that, for the most part, IN, EXISTS and even JOIN are implemented identically/the query optimisers of modern DBs rewrite/approach these different keywords in the same way anyway - is there a modern database out there that does still exhibit an appreciable performance difference with EXISTS vs IN and what is the actual implementation detail that causes the difference?

Side note; in that question, some references to Oracle 8/SQL Server 2000 were made and the strategies they adopted, and I can see how a performance difference would arise with them, but I don't think these can be classed as modern databases unless the latest iterations of these products still implement these keywords the same as they did 20 years ago...

Caius Jard
  • 72,509
  • 5
  • 49
  • 80
  • This question might be too broad, because even within a given RDBMS the performance of `IN` over `EXISTS` can vary. For example, see [this SO question](https://stackoverflow.com/questions/14190788/subqueries-with-exists-vs-in-mysql) which discusses `IN` versus `EXISTS` in MySQL for subqueries. Either might be the fastest, depending on the situation. – Tim Biegeleisen Jan 17 '19 at 05:37
  • This is RDBMS dependent but it's not about static lists, it's about putting a `SELECT` inside the `IN`. Older versions of SQL Server might run the actual SELECT and use it as a static list rather than joining directly on a table index. I believe newer versions are smarter. I have no tests to prove it, I can only say I have seen this happen in the wild - replacing an `IN (SELECT somestuff)` with an `EXISTS (SELECT *)` changed from a IN list search to a join on a clustered index. – Nick.Mc Jan 17 '19 at 05:41
  • 1
    I just compared query plans on SQL 2016 for a trivial comparison between the IN case and the EXISTS case and the query plans were identical. So there's probably no difference anymore in SQL Server anyway. It might be different for a much more complex query. I guess the answer is..... "Data Scientist" ;) – Nick.Mc Jan 17 '19 at 05:50
  • 1
    Assuming you're comparing `WHERE outer.id IN (SELECT inner.id ...)` vs `WHERE EXISTS (SELECT ... WHERE inner.id = outer.id)` then a smarter optimizer will treat both queries as identical. SQL Server does that. – Salman A Jan 17 '19 at 05:58
  • @Tim thanks for that link; I was interested to note that even there people seemed to be claiming that exists was some sort of magical thing that didn't compare data like IN has to, though there was a reasonable assertion that IN might use Count, and thus a performance difference might be observed for "count scans the whole list" reasons – Caius Jard Jan 17 '19 at 06:00
  • Incidentally @Tim, your link eventually led me to https://dev.mysql.com/doc/refman/8.0/en/subquery-optimization-with-exists.html which was interesting reading around optimization challenges, particularly with null handling and how the developer can use advance knowledge of nulls in the data to choose a query keyword that enhances performance by changing the null handling strategy – Caius Jard Jan 17 '19 at 06:20
  • 1
    You can almost always convert one into the other, so why bother trying to learn a hard and fast rule? *If* your query is performing poorly, *then* examine the plans and *if* you're curious whether it'll make a difference, re-write the query into the other form and *measure* it. What's almost universally true is that there are no universal rules you can follow that will always produce the best performance. – Damien_The_Unbeliever Jan 17 '19 at 07:56
  • Hi @Damien_The_Unbeliever, it's not so much about learning a hard and fast rule, it was more about questioning the validity of the assertions on some answers of the other (highly noted, search engine visible) question. I suspect it's of historical relevance only and perhaps some gardening is required on the other question – Caius Jard Jan 17 '19 at 08:07

1 Answers1

2

Although this is too broad, there are multiple semantic differences among the three approaches:

  • JOIN can result in multiple rows when there are multiple matches.
  • NOT IN filters out all rows if any value is NULL.
  • EXISTS works in all databases regardless of conditions, including conditions on multiple columns. IN (historically and in many databases) only works on one column.

Although there are cases where the three overlap, they are not equivalent.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786