Postgres: SELECT large number of rows using IN vs JOIN vs

Question

I need to retrieve 1,000 - 50,000 records identified by unique numeric ID from a ~150M record table. We are hosting the database in AWS RDS. The table has several integer columns, one character varying(500) and one bigint for id column. Every column has btree index.

The current production query is

SELECT *
FROM mytable t
WHERE id IN (N1, N2, .. Nm)

This returns in under 1 second for m < 1,000, which is acceptable. The problem is the time increases linearly with m. The query takes 20+ seconds for m = 30,000.

We tried creating indexed temporary table and using INNER JOIN with no noticeable performance improvement. (https://stackoverflow.com/a/24647700/226960)

Here is the abbreviated dump for m > 70k.

CREATE TEMPORARY TABLE temp_phrases (phrase_id integer) ON COMMIT DROP
CREATE TABLE temp_phrases: OK, time: 0.01 seconds.
CREATE INDEX temp_phrases_phrase_id_idx ON temp_phrases(phrase_id)
CREATE INDEX '.temp_phrases.'_phrase_id_idx: OK, time: 0 seconds.
INSERT INTO TABLE temp_phrases: 70544, time: 0.3 seconds.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
EXPLAIN SELECT * FROM temp_phrases LEFT JOIN thesaurus ON thesaurus.id = temp_phrases.phrase_id
Nested Loop Left Join  (cost=0.57..665368.61 rows=79815 width=34)
->  Seq Scan on temp_phrases (cost=0.00..1111.15 rows=79815 width=4)
->  Index Scan using thesaurus_pkey on thesaurus  (cost=0.57..8.31 rows=1 width=42)
    Index Cond: (id = temp_phrases.phrase_id)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
SELECT * FROM temp_phrases LEFT JOIN thesaurus ON thesaurus.id = temp_phrases.phrase_id
TEMP TABLE AND JOIN: 70544 results, 52.2seconds

It takes less than one second to get results if we repeat the query, which would indicate hardware-related bottleneck
https://stackoverflow.com/a/24254825/226960

Is it possible to improve the original id IN(_list_) query? Would getting additional RDS IOPS help?

EDIT
EXPLAIN (ANALYZE, BUFFERS) output

INSERT INTO TABLE temp_phrases: 41504, time: 0.17 seconds.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM temp_phrases LEFT JOIN thesaurus ON thesaurus.id = temp_phrases.phrase_id
Nested Loop Left Join  (cost=0.57..396319.90 rows=46920 width=34) (actual time=0.708..23874.200 rows=41504 loops=1)
  Buffers: shared hit=167593 read=39458 dirtied=138, local hit=184
  ->  Seq Scan on temp_phrases  (cost=0.00..653.20 rows=46920 width=4) (actual time=0.012..21.138 rows=41504 loops=1)
        Buffers: local hit=184
  ->  Index Scan using thesaurus_pkey on thesaurus  (cost=0.57..8.42 rows=1 width=42) (actual time=0.569..0.572 rows=1 loops=41504)
        Index Cond: (id = temp_phrases.phrase_id)
        Buffers: shared hit=167593 read=39458 dirtied=138
Planning time: 1.493 ms
Execution time: 23887.493 ms
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
SELECT * FROM temp_phrases LEFT JOIN thesaurus ON thesaurus.id = temp_phrases.phrase_id
TEMP TABLE AND JOIN: 41504 results, 24.2seconds

Try with `EXISTS` instead of `IN`. I believe you will get better performance. — Jacob H, Dec 14 '17 at 21:27
Please **[EDIT]** your question and add the execution plans generated using **`explain (analyze, buffers)`**. [**Formatted text**](http://stackoverflow.com/help/formatting) please, [no screen shots](http://meta.stackoverflow.com/questions/285551/why-may-i-not-upload-images-of-code-on-so-when-asking-a-question/285557#285557) — , Dec 14 '17 at 22:12
Do a (VACUUM) ANALYZE after the INSERT INTO temp_phrases. This will update the statistics, and maybe result in an indexed plan. — joop, Dec 15 '17 at 13:00

score 0 · Answer 1 · answered Dec 14 '17 at 23:26

0

Using btree index, or no index at all, will just make it try to match every single combinations of rows.

If you however make it stop after a match, you will get twice the speed.

Unique key or primary key on the temporary table. Since when you join on it, it will know that once one match was found, no other match will be found.

In my experience, when selecting from A, and joining with B, it helps a lot to join on a unique key. Performance boosts are usually 50% less time (compared to a simple index).

If a unique key is not possible in other situations, a hash index will do the same, though it will take a while to index.

But as you can see in the plan, it does a sequence scan, and not an index scan of the temporary table. Using any index will probably help, since you have quite a lot of tuples there.

answered Dec 14 '17 at 23:26

Alex

14,338
5
41
59

We do have index on the temp table since Craig Ringer recommends that here https://stackoverflow.com/a/24647700/226960, but we are not forcing it. According to a_horse_with_no_name, postgres (and other engines) will use sequential if more than 5-10% rows are returned. https://stackoverflow.com/a/5203827/226960 – BojanG Dec 15 '17 at 00:31
I'm not that strong on the theoretical side, but I just know from the experience of joining 10M row table with 3M row table B, where close to 100% of the rows of table B were selected. Adding a unique key for the join condition made things faster, consistently in multiple cases. Looking at the plan more carefully however, it looks like `thesaurus` table is the one that will benefit from the unique key in this query with this plan. Or, maybe, a hash index. – Alex Dec 15 '17 at 07:56

Postgres: SELECT large number of rows using IN vs JOIN vs

1 Answers1