Optimization: WHERE x IN (1, 2 .., 100.000) vs INNER JOIN tmp_table USING(x)?

Question

I've visited one interesting job interview recently. There I was asked a question about optimizing a query with a WHERE..IN clause containing long list of scalars (thousands of values, that is). This question is NOT about subqueries in the IN clause, but about simple list of scalars.

I answered right away, that this can be optimized using an INNER JOIN with another table (possibly temporary one), which will contain only those scalars. My answer was accepted and there was a note from the reviewer, that "no database engine currently can optimize long WHERE..IN conditions to be performant enough". I nodded.

But when I walked out, I started to have some doubts. The condition seemed rather trivial and widely used for modern RDBMS not to be able to optimize it. So, I started some digging.

PostgreSQL:

It seems, that PostgreSQL parse scalar IN() constructions into ScalarArrayOpExpr structure, which is sorted. This structure is later used during index scan to locate matching rows. EXPLAIN ANALYZE for such queries shows only one loop. No joins are done. So, I expect such query to be even faster, than INNER JOIN. I tried some queries on my existing database and my tests proved that position. But I didn't care about test purity and that Postgres was under Vagrant so I might be wrong.

MSSQL Server:

MSSQL Server builds a hash structure from the list of constant expressions and then does a hash join with the source table. Even though no sorting seems to be done, that is a performance match, I think. I didn't do any tests since I don't have any experience with this RDBMS.

MySQL Server:

The 13th of these slides says, that before 5.0 this problem indeed took place in MySQL with some cases. But other than that, I didn't find any other problem related to bad IN () treatment. I didn't find any proofs of the inverse, unfortunately. If you did, please kick me.

SQLite:

Documentation page hints some problems, but I tend to believe things described there are really at conceptual level. No other information was found.

So, I'm starting to think I misunderstood my interviewer or misused Google ;) Or, may be, it's because we didn't set any conditions and our talk became a little vague (we didn't specify any concrete RDBMS or other conditions. That was just abstract talk).

It looks like the days, where databases rewrote IN() as a set of OR statements (which can cause problems sometimes with NULL values in lists, btw) are long ago. Or not?

Of course, in cases where a list of scalars is longer than allowed database protocol packet, INNER JOIN might be the only solution available.

I think in some cases query parsing time (if it was not prepared) alone can kill performance.

Also databases could be unable to prepare IN(?) query which will lead to reparsing it again and again (which may kill performance). Actually, I never tried, but I think that even in such cases query parsing and planning is not huge comparing to query execution.

But other than that I do not see other problems. Well, other than the problem of just HAVING this problem. If you have queries, that contain thousands of IDs inside, something is wrong with your architecture.

Do you?

From my experience SQL Server is getting query planner timeouts on large ammounts of IN parameters. — Jakub Kania, Dec 02 '15 at 16:23
Although interesting it does not fit this site. You know that... I'm voting to close. — Clodoaldo Neto, Dec 02 '15 at 16:30
I wrote [This thing](http://stackoverflow.com/a/34015333) that had more to do with random numbers. I did work it end-to-end. What I was calling Appendix C uses large in lists. One thousand elements. Results 2 tenths of a second. — Drew, Dec 03 '15 at 04:56
I saw a case of a 1TB MySQL table with 70K values in the `IN()`. It worked. It was "reasonably" fast considering how much work it had to do. It also uncovered some memory problems in MySQL (4.1(?)). — Rick James, Dec 09 '15 at 01:05
@RickJames What was the EXPLAIN for that query and what were the timings? — Vladislav Rastrusny, Dec 09 '15 at 08:01
@VladislavRastrusny - Sorry, that was 3 years ago in a previous job. — Rick James, Dec 09 '15 at 17:49

score 2 · Accepted Answer · answered Dec 02 '15 at 16:10

Your answer is only correct if you build an index (preferably a primary key index) on the list, unless the list is really small.

Any description of optimization is definitely database specific. However, MySQL is quite specific about how it optimizes in:

Returns 1 if expr is equal to any of the values in the IN list, else returns 0. If all values are constants, they are evaluated according to the type of expr and sorted. The search for the item then is done using a binary search. This means IN is very quick if the IN value list consists entirely of constants.

This would definitely be a case where using IN would be faster than using another table -- and probably faster than another table using a primary key index.

I think that SQL Server replaces the IN with a list of ORs. These would then be implemented as sequential comparisons. Note that sequential comparisons can be faster than a binary search, if some elements are much more common than others and those appear first in the list.

score -1 · Answer 2 · answered Dec 02 '15 at 16:05

-1

I think it is bad application design. Those values using IN operator are most probably not hardcoded but dynamic. In such case we should always use prepared statements the only reliable mechanism to prevent SQL injection. In each case it will result in dynamically formatting the prepared statement (as number of placeholders is dynamic too) and it will also result in having excessive hard parsing (as many unique queries as we have number of IN values - IN (?), IN (?,?), ...). I would either load these values into table as use join as you mentioned (unless loading is too overhead) or use Oracle pipelined function IN foo(params) where params argument can be complex structure (array) coming from memory (PLSQL/Java etc). If the number of values is larger I would consider using EXISTS (select from mytable m where m.key=x.key) or EXISTS (select x from foo(params) instead of IN. In such case EXISTS provides better performance than IN.

answered Dec 02 '15 at 16:05

rolish

147
1
5

_I think it is bad application design_ Your whole answer is tangent to the question. – Clodoaldo Neto Dec 02 '15 at 16:29
My answer was probably better be put as a comment to the original question as it is in fact not answer. I fully agree with Vladislav's sentence "If you have queries, that contain thousands of IDs inside, something is wrong with your architecture." which means there is no point to answer that question as it becomes useless academical discussion about optimizing wrong use of SQL language. – rolish Dec 04 '15 at 18:17
I am not sure using IN with large id list is ALWAYS a bad architecture. I think that depends on the task and in some cases may be necessary. Though in MOST cases, the architecture should be revised carefully to check if such cases can be avoided. – Vladislav Rastrusny Dec 09 '15 at 08:00

Optimization: WHERE x IN (1, 2 .., 100.000) vs INNER JOIN tmp_table USING(x)?

2 Answers2