I need to retrieve 1,000 - 50,000 records identified by unique numeric ID from a ~150M record table. We are hosting the database in AWS RDS. The table has several integer
columns, one character varying(500)
and one bigint
for id column. Every column has btree
index.
The current production query is
SELECT *
FROM mytable t
WHERE id IN (N1, N2, .. Nm)
This returns in under 1 second for m < 1,000, which is acceptable. The problem is the time increases linearly with m. The query takes 20+ seconds for m = 30,000.
We tried creating indexed temporary table and using INNER JOIN with no noticeable performance improvement. (https://stackoverflow.com/a/24647700/226960)
Here is the abbreviated dump for m > 70k.
CREATE TEMPORARY TABLE temp_phrases (phrase_id integer) ON COMMIT DROP
CREATE TABLE temp_phrases: OK, time: 0.01 seconds.
CREATE INDEX temp_phrases_phrase_id_idx ON temp_phrases(phrase_id)
CREATE INDEX '.temp_phrases.'_phrase_id_idx: OK, time: 0 seconds.
INSERT INTO TABLE temp_phrases: 70544, time: 0.3 seconds.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
EXPLAIN SELECT * FROM temp_phrases LEFT JOIN thesaurus ON thesaurus.id = temp_phrases.phrase_id
Nested Loop Left Join (cost=0.57..665368.61 rows=79815 width=34)
-> Seq Scan on temp_phrases (cost=0.00..1111.15 rows=79815 width=4)
-> Index Scan using thesaurus_pkey on thesaurus (cost=0.57..8.31 rows=1 width=42)
Index Cond: (id = temp_phrases.phrase_id)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
SELECT * FROM temp_phrases LEFT JOIN thesaurus ON thesaurus.id = temp_phrases.phrase_id
TEMP TABLE AND JOIN: 70544 results, 52.2seconds
It takes less than one second to get results if we repeat the query, which would indicate hardware-related bottleneck
https://stackoverflow.com/a/24254825/226960
Is it possible to improve the original id IN(_list_)
query? Would getting additional RDS IOPS help?
EDIT
EXPLAIN (ANALYZE, BUFFERS)
output
INSERT INTO TABLE temp_phrases: 41504, time: 0.17 seconds.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM temp_phrases LEFT JOIN thesaurus ON thesaurus.id = temp_phrases.phrase_id
Nested Loop Left Join (cost=0.57..396319.90 rows=46920 width=34) (actual time=0.708..23874.200 rows=41504 loops=1)
Buffers: shared hit=167593 read=39458 dirtied=138, local hit=184
-> Seq Scan on temp_phrases (cost=0.00..653.20 rows=46920 width=4) (actual time=0.012..21.138 rows=41504 loops=1)
Buffers: local hit=184
-> Index Scan using thesaurus_pkey on thesaurus (cost=0.57..8.42 rows=1 width=42) (actual time=0.569..0.572 rows=1 loops=41504)
Index Cond: (id = temp_phrases.phrase_id)
Buffers: shared hit=167593 read=39458 dirtied=138
Planning time: 1.493 ms
Execution time: 23887.493 ms
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
SELECT * FROM temp_phrases LEFT JOIN thesaurus ON thesaurus.id = temp_phrases.phrase_id
TEMP TABLE AND JOIN: 41504 results, 24.2seconds