1

Imagine you have a table "users" which contains 100,000 records and you need to find 3000 items by id.

Would it be faster to make this query via

Select * from users where id IN (2,5,30,89,...) # 3000 items

or would it be faster to store these 3000 items in another table and do a subquery, like:

Select * from users where id IN (select distinct id from lookuptable)
# lookuptable contains the 3000 records

Or is this completely the same? Thank you!

BvuRVKyUVlViVIc7
  • 11,641
  • 9
  • 59
  • 111
  • 1
    IN (hardcoded value) has constant number of valuez where lookup from table could vary. So query optimizer may treat them differently. Second thing: `id IN (select distinct id from lookuptable)` at least in Oracle could be rewritten by query optimizer as `JOIN (nested loops)`. I suggest to compare actual execution plans. – Lukasz Szozda Sep 29 '18 at 09:12
  • The best way to handle this would be use JOIN between `users` and `lookuptable` – Chetan Sep 29 '18 at 09:12
  • Hi. (As we can expect:) This is a faq. Please always google many clear, concise & specificversions/phrasings of your question/problem/goal with & without your particularstrings/names & read many answers. Add relevant keywords you discover to yoursearches. If you don't find an answer then post, using 1 variant search as title &keywords for tags. See the downvote arrow mouseover text. When you do have a non-duplicate code question to post please read & act on [mcve]. – philipxy Sep 29 '18 at 10:38
  • Possible duplicate of [Postgresql IN operator Performance: List vs Subquery](https://stackoverflow.com/q/40443409/3404097) – philipxy Sep 29 '18 at 10:50
  • 2
    My current generic comment re "better"/"best" etc: There's no such thing as "better"/"best" in engineering unless *you* define it. Also unfortunately all reasonable practical definitions require a ridiculous amount of experience with a ridiculous number of factors that interact with chaotic sensitivity to details. Make straightforward designs. When you demonstrate via measurement that a design and all alternatives you can think of have problems (whatever that means at the time), then ask a very specific question. Which should also define "better"/"best". https://meta.stackexchange.com/q/204461 – philipxy Sep 29 '18 at 10:51
  • See also [Optimizing a Postgres query with a large IN](https://dba.stackexchange.com/q/91247/43932). (Besides your own search results.) – philipxy Sep 29 '18 at 10:57
  • I'm still working on my answer, but [here](https://stackoverflow.com/q/52712022/6464308) is an interesting reference. – Laurenz Albe Oct 09 '18 at 06:03

3 Answers3

0

The best way to find out is to use explain analyze on a working dataset. sql explain It will show You the query execution times and the query route.

query optimizer may use different techniques depending on the table size, database settings, memory settings etc.

If the lookup table have only 3000 records You don't need distinct on it, if it's really big and have much more records and distinct create 3000 unique records then the first solution might be faster.

0

In PostgreSQL, the fastest way is to create a lookup table and query like this:

SELECT * FROM users AS u
WHERE EXISTS (SELECT 1 FROM lookuptable AS l
              WHERE u.id = l.id);
Laurenz Albe
  • 209,280
  • 17
  • 206
  • 263
  • 1
    . . You are saying that the process of creating the temporary table, inserting the data into it, and running the query with the temporary table is *always* faster than using `in` with a fixed list. I don't believe that. – Gordon Linoff Sep 29 '18 at 11:35
  • I agree that it should be tested. But that's my guess, yes. – Laurenz Albe Sep 29 '18 at 20:22
0

I've created a database with the requirements and I've tested it. From the "timing" point of view really there is no difference but maybe it's because my testing sandbox environment.

Anyway I've "explained" these tree queries :

1- select * from users where id in (1,2,3,4,5,6,7,8,9,10,..3000)

cost: "Index Scan using users_pkey on users (cost=4.04..1274.75 rows=3000 width=11)"" Index Cond: (id = ANY ('{1,2,3,4,5,6,7,8,9,10 (...)"

2- SELECT * FROM users AS u WHERE EXISTS (SELECT 1 FROM lookuptable A-- l WHERE u.id = l.id); <- Note that i've removed the 'distinct', it's useless.

cost: "Merge Semi Join (cost=103.22..364.35 rows=3000 width=11)"

" Merge Cond: (u.id = l.id)"

" -> Index Scan using users_pkey on users u (cost=0.29..952.68 rows=30026 width=11)"

" -> Index Scan using users_pkey on users u (cost=0.29..952.68 rows=30026 width=11)"

3- Select * from users where id IN (select id from lookuptable)

"Merge Semi Join (cost=103.22..364.35 rows=3000 width=11)"

" Merge Cond: (users.id = lookuptable.id)"

" -> Index Scan using users_pkey on users (cost=0.29..952.68 rows=30026 width=11)"

" -> Index Only Scan using lookuptable_pkey on lookuptable (cost=0.28..121.28 rows=3000 width=4)"

The explain graphic of the last two queries:

This is the explain graiphic of the last two


Anyway as I've read from some comments above, you have also to add to the costs of the queries the costs of populating the lookuptable.. and also the fact that you have to split the "querying" into different executions which may cause "transactional problems". I'll use the first query.

Gaetano Piazzolla
  • 1,388
  • 1
  • 16
  • 31