Is a where query faster with IN or a subquery to another table?

Question

Imagine you have a table "users" which contains 100,000 records and you need to find 3000 items by id.

Would it be faster to make this query via

Select * from users where id IN (2,5,30,89,...) # 3000 items

or would it be faster to store these 3000 items in another table and do a subquery, like:

Select * from users where id IN (select distinct id from lookuptable)
# lookuptable contains the 3000 records

Or is this completely the same? Thank you!

IN (hardcoded value) has constant number of valuez where lookup from table could vary. So query optimizer may treat them differently. Second thing: `id IN (select distinct id from lookuptable)` at least in Oracle could be rewritten by query optimizer as `JOIN (nested loops)`. I suggest to compare actual execution plans. — Lukasz Szozda, Sep 29 '18 at 09:12
The best way to handle this would be use JOIN between `users` and `lookuptable` — Chetan, Sep 29 '18 at 09:12
Hi. (As we can expect:) This is a faq. Please always google many clear, concise & specificversions/phrasings of your question/problem/goal with & without your particularstrings/names & read many answers. Add relevant keywords you discover to yoursearches. If you don't find an answer then post, using 1 variant search as title &keywords for tags. See the downvote arrow mouseover text. When you do have a non-duplicate code question to post please read & act on [mcve]. — philipxy, Sep 29 '18 at 10:38
Possible duplicate of [Postgresql IN operator Performance: List vs Subquery](https://stackoverflow.com/q/40443409/3404097) — philipxy, Sep 29 '18 at 10:50
My current generic comment re "better"/"best" etc: There's no such thing as "better"/"best" in engineering unless *you* define it. Also unfortunately all reasonable practical definitions require a ridiculous amount of experience with a ridiculous number of factors that interact with chaotic sensitivity to details. Make straightforward designs. When you demonstrate via measurement that a design and all alternatives you can think of have problems (whatever that means at the time), then ask a very specific question. Which should also define "better"/"best". https://meta.stackexchange.com/q/204461 — philipxy, Sep 29 '18 at 10:51
See also [Optimizing a Postgres query with a large IN](https://dba.stackexchange.com/q/91247/43932). (Besides your own search results.) — philipxy, Sep 29 '18 at 10:57
I'm still working on my answer, but [here](https://stackoverflow.com/q/52712022/6464308) is an interesting reference. — Laurenz Albe, Oct 09 '18 at 06:03

score 0 · Answer 1 · edited Sep 29 '18 at 11:11

0

The best way to find out is to use explain analyze on a working dataset. sql explain It will show You the query execution times and the query route.

query optimizer may use different techniques depending on the table size, database settings, memory settings etc.

If the lookup table have only 3000 records You don't need distinct on it, if it's really big and have much more records and distinct create 3000 unique records then the first solution might be faster.

edited Sep 29 '18 at 11:11

answered Sep 29 '18 at 09:43

Krzysztof Czerwiński

189
1
6

1

`select distinct` is never needed for an `in` list. The compiler knows that the list should be distinct. – Gordon Linoff Sep 29 '18 at 11:36

score 0 · Answer 2 · answered Sep 29 '18 at 10:48

0

In PostgreSQL, the fastest way is to create a lookup table and query like this:

SELECT * FROM users AS u
WHERE EXISTS (SELECT 1 FROM lookuptable AS l
              WHERE u.id = l.id);

answered Sep 29 '18 at 10:48

Laurenz Albe

209,280
17
206
263

1

. . You are saying that the process of creating the temporary table, inserting the data into it, and running the query with the temporary table is *always* faster than using `in` with a fixed list. I don't believe that. – Gordon Linoff Sep 29 '18 at 11:35
I agree that it should be tested. But that's my guess, yes. – Laurenz Albe Sep 29 '18 at 20:22

Gaetano Piazzolla · Accepted Answer · 2018-09-29T13:57:37.450

I've created a database with the requirements and I've tested it. From the "timing" point of view really there is no difference but maybe it's because my testing sandbox environment.

Anyway I've "explained" these tree queries :

1- select * from users where id in (1,2,3,4,5,6,7,8,9,10,..3000)

cost: "Index Scan using users_pkey on users (cost=4.04..1274.75 rows=3000 width=11)"" Index Cond: (id = ANY ('{1,2,3,4,5,6,7,8,9,10 (...)"

2- SELECT * FROM users AS u WHERE EXISTS (SELECT 1 FROM lookuptable A-- l WHERE u.id = l.id); <- Note that i've removed the 'distinct', it's useless.

cost: "Merge Semi Join (cost=103.22..364.35 rows=3000 width=11)"

" Merge Cond: (u.id = l.id)"

" -> Index Scan using users_pkey on users u (cost=0.29..952.68 rows=30026 width=11)"

3- Select * from users where id IN (select id from lookuptable)

"Merge Semi Join (cost=103.22..364.35 rows=3000 width=11)"

" Merge Cond: (users.id = lookuptable.id)"

" -> Index Scan using users_pkey on users (cost=0.29..952.68 rows=30026 width=11)"

" -> Index Only Scan using lookuptable_pkey on lookuptable (cost=0.28..121.28 rows=3000 width=4)"

The explain graphic of the last two queries:

Anyway as I've read from some comments above, you have also to add to the costs of the queries the costs of populating the lookuptable.. and also the fact that you have to split the "querying" into different executions which may cause "transactional problems". I'll use the first query.

Is a where query faster with IN or a subquery to another table?

3 Answers3