Mysql Exists vs IN -- correlated subquery vs subquery?

Question

I'm curious about how the execution of EXISTS() is supposed to be faster than IN().

I was answering a question when Bill Karwin brought up a good point. when you use EXISTS() it is using a correlated subquery (dependent subquery) and IN() is only using a subquery.

EXPLAIN shows that EXISTS and NOT EXISTS both use a dependent subquery and IN / NOT IN both use just a subquery.. so I'm curious how a correlated subquery is faster than a subquery??

I've used EXISTS before and it does execute faster than IN which is why I'm confused.

Here is a SQLFIDDLE with the explains

EXPLAIN SELECT COUNT(t1.table1_id) 
FROM table1 t1 
WHERE EXISTS
(   SELECT 1 
    FROM table2 t2
    WHERE t2.table1_id <=> t1.table1_id
);

+-------+-----------------------+-----------+-------+---------------+-----------+--------+--------------------------+--------+------------------------------+
| ID    |   SELECT_TYPE         |   TABLE   | TYPE  | POSSIBLE_KEYS |   KEY     |KEY_LEN |  REF                     |   ROWS |  EXTRA                       |
+-------+-----------------------+-----------+-------+---------------+-----------+--------+--------------------------+--------+------------------------------+
|  1    |   PRIMARY             |   t1      | index | (null)        |   PRIMARY |   4    | (null)                   |   4    |  Using where; Using index    |
|  2    |   DEPENDENT SUBQUERY  |   t2      | REF   | table1_id     |  table1_id|   4    | db_9_15987.t1.table1_id  |   1    |  Using where; Using index    |
+-------+-----------------------+-----------+-------+---------------+-----------+--------+--------------------------+--------+------------------------------+

EXPLAIN SELECT COUNT(t1.table1_id) 
FROM table1 t1 
WHERE NOT EXISTS
(   SELECT 1 
    FROM table2 t2
    WHERE t2.table1_id = t1.table1_id
);
+-------+-----------------------+-----------+-------+---------------+-----------+--------+--------------------------+--------+------------------------------+
| ID    |   SELECT_TYPE         |   TABLE   | TYPE  | POSSIBLE_KEYS |   KEY     |KEY_LEN |  REF                     |   ROWS |  EXTRA                       |
+-------+-----------------------+-----------+-------+---------------+-----------+--------+--------------------------+--------+------------------------------+
|  1    |   PRIMARY             |   t1      | index | (null)        |   PRIMARY |   4    | (null)                   |   4    |  Using where; Using index    |
|  2    |   DEPENDENT SUBQUERY  |   t2      | ref   | table1_id     |  table1_id|   4    | db_9_15987.t1.table1_id  |   1    |  Using index                 |
+-------+-----------------------+-----------+-------+---------------+-----------+--------+--------------------------+--------+------------------------------+

EXPLAIN SELECT COUNT(t1.table1_id) 
FROM table1 t1 
WHERE t1.table1_id NOT IN 
(   SELECT t2.table1_id 
    FROM table2 t2
);
+-------+-------------------+-----------+-------+---------------+-----------+--------+----------+--------+------------------------------+
| ID    |   SELECT_TYPE     |   TABLE   | TYPE  | POSSIBLE_KEYS |   KEY     |KEY_LEN |  REF     |   ROWS |  EXTRA                       |
+-------+-------------------+-----------+-------+---------------+-----------+--------+----------+--------+------------------------------+
|  1    |   PRIMARY         |   t1      | index | (null)        |   PRIMARY |   4    | (null)   |   4    |  Using where; Using index    |
|  2    |   SUBQUERY        |   t2      | index | (null)        |  table1_id|   4    | (null)   |   2    |  Using index                 |
+-------+-------------------+-----------+-------+---------------+-----------+--------+----------+--------+------------------------------+

FEW questions

In the explains above, how does EXISTS have using where and using index in extras but NOT EXISTS does not have using where in extras?

How is a correlated subquery faster than a subquery?

So do you have a repro of `exists` executing faster? Also what version did you experience this on? `in` also [used to have the same problem](http://stackoverflow.com/q/3416076/73226) — Martin Smith, Sep 14 '14 at 21:08
@MartinSmith well I switched my queries from IN to EXISTS about a year ago because they were executed faster with EXISTS (not by a whole lot something like half a second to a second faster).. but I just got a new computer and downloaded the latest version of MySQL.. I just ran a query and IN ran faster by .004 seconds... has there been a fix for the execution plan/optimizer recently? — John Ruddell, Sep 14 '14 at 23:10
I don't know much about the MySql optimiser but I believe 5.6 introduced some changes. https://dev.mysql.com/doc/refman/5.6/en/subquery-optimization.html — Martin Smith, Sep 14 '14 at 23:17
@MartinSmith after reading through that doc it seems that if there is no group by or any aggregates IN will perform faster... if there is exists will perform faster... I have proven it with a test on my local. you should consider writing an answer with this and the other link you posted for bill karwin — John Ruddell, Sep 15 '14 at 04:37

score 11 · Accepted Answer · answered Sep 21 '14 at 13:37

This is a RDBMS-agnostic answer, but may help nonetheless. In my understanding, the correlated (aka, dependent) subquery is perhaps the most often falsely accused culprit for bad performance.

The problem (as it is most often described) is that it processes the inner query for every row of the outer query. Therefore, if the outer query returns 1,000 rows, and the inner query returns 10,000, then your query has to slog through 10,000,000 rows (outer×inner) to produce a result. Compared to the 11,000 rows (outer+inner) from a non-correlated query over the same result sets, that ain't good.

However, this is just the worst case scenario. In many cases, the DBMS will be able to exploit indexes to drastically reduce the rowcount. Even if only the inner query can use an index, the 10,000 rows becomes ~13 seeks, which drops the total down to 13,000.

The exists operator can stop processing rows after the first, cutting down the query cost further, especially when most outer rows match at least one inner row.

In some rare cases, I have seen SQL Server 2008R2 optimise correlated subqueries to a merge join (which traverses both sets only once - best possible scenario) where a suitable index can be found in both inner and outer queries.

The real culprit for bad performance is not necessarily correlated subqueries, but nested scans.

I think this. While RDBMS engines can do any kind of transformations internally so we never know, but if we follow question exactly, a subquery will need to be performed entirely while a correlated subquery - only for matching rows. In case `IN` subquery would produce a big result set, and we are interested in only a limitted number of rows, the correlated subquery may skip some work. MrTux answer also makes some sense though. — akostadinov, Mar 25 '22 at 21:52

MrTux · Answer 2 · 2014-09-14T21:16:10.443

This depends on the MySQL version - there is a bug in the MySQL query optimizer in versions up to 6.0.

Subqueries with "IN" were not optimized correctly (but executed again and again like dependant ones). This bug does not affect exists queries or joins.

The problem is that, for a statement that uses an IN subquery, the optimizer rewrites it as a correlated subquery. Consider the following statement that uses an uncorrelated subquery:

SELECT ... FROM t1 WHERE t1.a IN (SELECT b FROM t2);

The optimizer rewrites the statement to a correlated subquery:

SELECT ... FROM t1 WHERE EXISTS (SELECT 1 FROM t2 WHERE t2.b = t1.a);

If the inner and outer queries return M and N rows, respectively, the execution time becomes on the order of O(M×N), rather than O(M+N) as it would be for an uncorrelated subquery.

Refs.

The `O(MxN) case doesn't look as though it should apply for the OP in principle as the correlated sub query can use an index. In MySQL when it evaluates the inner un correlated sub query presumably it would still have to store it somewhere and end up replaying the materialised result in much the same way on the inner side of a nested loops join? In which case is the materialised result indexed? Or will it use a hash join or something here? — Martin Smith, Sep 14 '14 at 21:29

score 1 · Answer 3 · answered Sep 21 '14 at 19:22

As you already know, a subquery does not use values from the outer query, thus it is executed only once. The correlated subquery is synchronized, thus it is executed for every row processed in the outer query.

The advantage of using EXISTS is that, if it considered to be met, the execution of the subquery stops after returning at least one row. So, it can be quicker than a simple subquery. But it's not a general rule! All depend on the query your are executing, the query optimizer, and the SQL execution engine version.

EXISTS is recommended to be used when you have e.g an if conditional statement, because it is certainly quicker than count.

You can't really compare both subqueries using a simple benchmark of 4 or 3 queries.

Hope it's useful!

Mysql Exists vs IN -- correlated subquery vs subquery?

3 Answers3

Linked