Postgres select query making sequential scan instead of index scan on table with 18 Million rows

Question

I have a postgres table which has almost 18 Million rows and I am trying to run this query

select * from answer where submission_id = 5 and deleted_at is NULL;

There is an partial index on the table on column submission_id. This is the command used to create index

CREATE INDEX answer_submission_id ON answer USING btree (submission_id) WHERE (deleted_at IS NULL)

This is the explain analyse of the above select query

Gather  (cost=1000.00..3130124.70 rows=834 width=377) (actual time=7607.568..7610.130 rows=2 loops=1)
  Workers Planned: 2
  Workers Launched: 2
  Buffers: shared hit=2144966 read=3
  I/O Timings: read=6.169
  ->  Parallel Seq Scan on answer  (cost=0.00..3129041.30 rows=348 width=377) (actual time=6501.951..7604.623 rows=1 loops=3)
        Filter: ((deleted_at IS NULL) AND (submission_id = 5))
        Rows Removed by Filter: 62213625
        Buffers: shared hit=2144966 read=3
        I/O Timings: read=6.169
Planning Time: 0.117 ms
Execution Time: 7610.154 ms

Ideally it should pick the answer_submission_id index. But postgres is going for an sequential scan.

Any help would be really thankful

Could you try to run `vacuum ANALYZE answer` before you query again? — D-Shih, May 03 '22 at 12:28
Can you add the output to the following to the question, run with `psql`: 1) `\d answer` 2) `\dt+ answer`, 3) `\di+ answer_submission_id` — Laurenz Albe, May 03 '22 at 12:57
Even with the estimation error, it should be using the index. Are you sure it exists and is not marked as invalid? — jjanes, May 03 '22 at 16:26

score 1 · Answer 1 · answered May 03 '22 at 16:06

The execution plan shows us there is a deviation between the estimated read row and the actual read row.

Postgresql optimizer is a cost-based optimizer (CBO) queries will be executed by the smallest cost from execution plans.

so that the wrong statistics might choose a bad execution plan.

There is a link to represent the wrong statistics causing a slow query. Why are bad row estimates slow in Postgres?

Firstly I will use this query to search last_analyze & last_vacuum last time.

SELECT
  schemaname, relname,
  last_vacuum, last_autovacuum,
  vacuum_count, autovacuum_count,
  last_analyze,last_autoanalyze
FROM pg_stat_user_tables
where relname = 'tablename';

if your statistics are wrong we can use ANALYZE "tablename"to help us collect new statistics from the table, ANALYZE scans speed depends on table size.

For large tables, ANALYZE takes a random sample of the table contents, rather than examining every row. This allows even very large tables to be analyzed in a small amount of time. Note, however, that the statistics are only approximate, and will change slightly each time ANALYZE is run, even if the actual table contents did not change. This might result in small changes in the planner's estimated costs shown by EXPLAIN. In rare situations, this non-determinism will cause the planner's choices of query plans to change after ANALYZE is run. To avoid this, raise the amount of statistics collected by ANALYZE, as described below.

When we UPDATE and DELETE data that will create a dead tuple which might exist in the heap or indexes but we can't query that, VACUUM can help us to reclaim storage occupied by dead tuples.

Postgres select query making sequential scan instead of index scan on table with 18 Million rows

1 Answers1