138

I'm curious which of the following below would be more efficient?

I've always been a bit cautious about using IN because I believe SQL Server turns the result set into a big IF statement. For a large result set, this could result in poor performance. For small result sets, I'm not sure either is preferable. For large result sets, wouldn't EXISTS be more efficient?

WHERE EXISTS (SELECT * FROM Base WHERE bx.BoxID = Base.BoxID AND [Rank] = 2)

vs.

WHERE bx.BoxID IN (SELECT BoxID FROM Base WHERE [Rank = 2])
MarredCheese
  • 17,541
  • 8
  • 92
  • 91
Randy Minder
  • 47,200
  • 49
  • 204
  • 358
  • 9
    The best way to find out is to try it out and do some meassurements. – Klaus Byskov Pedersen Jan 14 '10 at 15:47
  • 11
    there's **got** to be a gazillion duplicates for this...... – marc_s Jan 14 '10 at 15:52
  • @marc_s: yes, but I think it really is case dependent. I guess the canonical answer is klausbyskov's (or whoever can claim prior art). – D'Arcy Rittich Jan 14 '10 at 15:54
  • 7
    FYI if you're wanting the *most* performant way, you can `select 1 from Base...` in your `where exists` since you don't actually care about the results, just that a row actually exists. – brad Aug 22 '11 at 15:22
  • @brad Not really. The compiler will disregard anything between the `select` and the `from` (except if you put another subquery in there or something similarly...strange). IMO, the syntax should have been `semi join table2 on ....`. – George Menoutis Nov 23 '21 at 18:12

10 Answers10

164

EXISTS will be faster because once the engine has found a hit, it will quit looking as the condition has proved true.

With IN, it will collect all the results from the sub-query before further processing.

MarredCheese
  • 17,541
  • 8
  • 92
  • 91
keithwarren7
  • 14,094
  • 8
  • 53
  • 74
  • 85
    This used to be true but in current versions (at least 2008) the optimizer is much smarter... it actually treats IN () just like an EXISTS (). – Aaron Bertrand Jan 14 '10 at 16:51
  • 14
    @Aaron - yes, typically the optimzer will internally produce a better plan. However, relying on internal shortcuts could be detrimental in more complex scenarios. – Scott Coates May 26 '11 at 05:43
  • Strange. While the query plan is exactly the same for both for one of my queries, the explain shows a rows number of 972 for IN and 959 for EXIST. All else is the same. This is on Mysql 5.1.42 though, so it may be just old. – techdude Dec 19 '14 at 20:55
  • 3
    This is just simply wrong. It was in 2010 and still is. – Magnus May 23 '16 at 19:54
  • 4
    IN and EXISTS have the exact same query plan, and IO. There is no reason for thinking they are different in performance. check your time statistics and comprove yourself – Nelssen Oct 02 '17 at 09:25
  • I've seen IN get treated as a full sub query even in SQL 2019 and had to switch it to EXISTS. I think stats out of date might cause this to happen, but I am just guessing. Most of the time they are the same. – Brain2000 Oct 06 '22 at 21:05
45

The accepted answer is shortsighted and the question a bit loose in that:

1) Neither explicitly mention whether a covering index is present in the left, right, or both sides.

2) Neither takes into account the size of input left side set and input right side set.
(The question just mentions an overall large result set).

I believe the optimizer is smart enough to convert between "in" vs "exists" when there is a significant cost difference due to (1) and (2), otherwise it may just be used as a hint (e.g. exists to encourage use of an a seekable index on the right side).

Both forms can be converted to join forms internally, have the join order reversed, and run as loop, hash or merge--based on the estimated row counts (left and right) and index existence in left, right, or both sides.

crokusek
  • 5,345
  • 3
  • 43
  • 61
  • 1
    The optimiser **always** gives the same plan for `IN` and `EXISTS`. Try and come up with any case where they don't get the same plan (though this does not apply to `NOT IN` and `NOT EXISTS`) – Martin Smith Apr 30 '19 at 12:17
  • @MartinSmith I assume you know what you are talking about, but do you have any proof that the plans are always the same? If so, it'd clear up the decade-long disagreement here. – MarredCheese Jun 06 '19 at 05:16
  • 1
    @MarredCheese - the onus is on the people that claim that it is different to produce a single example of this – Martin Smith Jun 06 '19 at 07:46
39

I've done some testing on SQL Server 2005 and 2008, and on both the EXISTS and the IN come back with the exact same actual execution plan, as other have stated. The Optimizer is optimal. :)

Something to be aware of though, EXISTS, IN, and JOIN can sometimes return different results if you don't phrase your query just right: http://weblogs.sqlteam.com/mladenp/archive/2007/05/18/60210.aspx

Adam Nofsinger
  • 4,004
  • 3
  • 34
  • 42
9

I'd go with EXISTS over IN, see below link:

SQL Server: JOIN vs IN vs EXISTS - the logical difference

There is a common misconception that IN behaves equally to EXISTS or JOIN in terms of returned results. This is simply not true.

IN: Returns true if a specified value matches any value in a subquery or a list.

Exists: Returns true if a subquery contains any rows.

Join: Joins 2 resultsets on the joining column.

Blog credit: https://stackoverflow.com/users/31345/mladen-prajdic

Community
  • 1
  • 1
Tanner
  • 22,205
  • 9
  • 65
  • 83
6

There are many misleading answers answers here, including the highly upvoted one (although I don't believe their ops meant harm). The short answer is: These are the same.

There are many keywords in the (T-)SQL language, but in the end, the only thing that really happens on the hardware is the operations as seen in the execution query plan.

The relational (maths theory) operation we do when we invoke [NOT] IN and [NOT] EXISTS is the semi join (anti-join when using NOT). It is not a coincidence that the corresponding sql-server operations have the same name. There is no operation that mentions IN or EXISTS anywhere - only (anti-)semi joins. Thus, there is no way that a logically-equivalent IN vs EXISTS choice could affect performance because there is one and only way, the (anti)semi join execution operation, to get their results.

An example:

Query 1 ( plan )

select * from dt where dt.customer in (select c.code from customer c where c.active=0)

Query 2 ( plan )

select * from dt where exists (select 1 from customer c where c.code=dt.customer and c.active=0)
George Menoutis
  • 6,894
  • 3
  • 19
  • 43
  • Have you tested it? If so, can you share your SQL and your results? – UnhandledExcepSean Dec 04 '18 at 15:17
  • Tested it multiple times. I can create another test case, and I will, but a test case does not mean that the optimizer will do the exact same plan on tables with different statistics. This might lead someone to think the answer is partial - but the nonexistence of multiple semijoin operators is fact. Maybe I'll find a list somewhere and link it. – George Menoutis Dec 04 '18 at 15:20
4

The execution plans are typically going to be identical in these cases, but until you see how the optimizer factors in all the other aspects of indexes etc., you really will never know.

Cade Roux
  • 88,164
  • 40
  • 182
  • 265
  • this, when I read how good is the optimizer, I'm always thinking that in exactly my use case it will fail ¯\_(ツ)_/¯ – akostadinov Mar 25 '22 at 22:06
3

So, IN is not the same as EXISTS nor it will produce the same execution plan.

Usually EXISTS is used in a correlated subquery, that means you will JOIN the EXISTS inner query with your outer query. That will add more steps to produce a result as you need to solve the outer query joins and the inner query joins then match their where clauses to join both.

Usually IN is used without correlating the inner query with the outer query, and that can be solved in only one step (in the best case scenario).

Consider this:

  1. If you use IN and the inner query result is millions of rows of distinct values, it will probably perform SLOWER than EXISTS given that the EXISTS query is performant (has the right indexes to join with the outer query).

  2. If you use EXISTS and the join with your outer query is complex (takes more time to perform, no suitable indexes) it will slow the query by the number of rows in the outer table, sometimes the estimated time to complete can be in days. If the number of rows is acceptable for your given hardware, or the cardinality of data is correct (for example fewer DISTINCT values in a large data set) IN can perform faster than EXISTS.

  3. All of the above will be noted when you have a fair amount of rows on each table (by fair I mean something that exceeds your CPU processing and/or ram thresholds for caching).

So the ANSWER is it DEPENDS. You can write a complex query inside IN or EXISTS, but as a rule of thumb, you should try to use IN with a limited set of distinct values and EXISTS when you have a lot of rows with a lot of distinct values.

The trick is to limit the number of rows to be scanned.

Regards,

MarianoC

MarianoC
  • 69
  • 3
1

To optimize the EXISTS, be very literal; something just has to be there, but you don't actually need any data returned from the correlated sub-query. You're just evaluating a Boolean condition.

So:

WHERE EXISTS (SELECT TOP 1 1 FROM Base WHERE bx.BoxID = Base.BoxID AND [Rank] = 2)

Because the correlated sub-query is RBAR, the first result hit makes the condition true, and it is processed no further.

Kohlbrr
  • 3,861
  • 1
  • 21
  • 24
Josh Lewis
  • 19
  • 1
  • I'd always be extremely cautious in using LEFT JOIN + NULL coding, because it is very easy to get missed or skewed results if you aren't very careful in your NULL handling. I've very rarely found a situation where EXISTS or a CTE ( for finding duplication, or synthetic insertion for missing data) , doesn't both meet the same requirements and outperform the LEFT JOIN + NULL – Josh Lewis Apr 05 '14 at 20:21
  • 3
    TOP 1 should be complete extraneous (or event redundant) when used with EXISTS. EXISTS always returns as soon it find any matching row. – Karl Kieninger May 22 '15 at 16:22
  • I did not see any performance benefit with this approach so far. Please show some screenshots of the Execution Plans – DaFi4 May 20 '16 at 08:55
1

I know that this is a very old question but I think my answer would add some tips.

I just came across a blog on mssqltips sql exists vs in vs join and it turns out that it is generally the same performance wise.

But the downside of one vs the other are as follows:

  1. The in statement has a downside that it can only compare the two tables on one column.

  2. The join statement will run on duplicate values, while in and exists will ignore duplicates.

But when you look at the execution time there is no big difference.

The interesting thing is when you create an index on the table, the execution from the join is better.

And I think that join has another upside that it's easier to write and understand especially for newcomers.

Shmiel
  • 1,201
  • 10
  • 25
-1

Off the top of my head and not guaranteed to be correct: I believe the second will be faster in this case.

  1. In the first, the correlated subquery will likely cause the subquery to be run for each row.
  2. In the second example, the subquery should only run once, since not correlated.
  3. In the second example, the IN will short-circuit as soon as it finds a match.
D'Arcy Rittich
  • 167,292
  • 40
  • 290
  • 283