1

I have a table with ~4.5 million data for 3 months. I have indexed the "accessedAt" column which stores time in epoch and has datatype BIGINT, When I ran the query postgres did a bit map index scan on 700K rows. in (~48.2s). But when I dropped the index it did a seq scan on 700K rows in (~4s).

[QUERY] explain analyze select id from access_histories where "accessedAt" >= 1631903400 and "accessedAt" <= 1633112999; i.e from 17 sept 2021 - 01 oct 2021.

Bitmap Heap Scan on access_histories  (cost=14655.35..144416.85 rows=715992 width=8) (actual time=238.837..21154.597 rows=715535 loops=1)
   Recheck Cond: (("accessedAt" >= 1631903400) AND ("accessedAt" <= 1633112999))
   Rows Removed by Index Recheck: 1716759
   Heap Blocks: exact=48015 lossy=33133
   Buffers: shared hit=3 read=82972 written=8
   I/O Timings: read=19860.378 write=4.389
   ->  Bitmap Index Scan on "access_histories_accessedAt_idx"  (cost=0.00..14476.35 rows=715992 width=0) (actual time=229.267..229.272 rows=715535 loops=1)
         Index Cond: (("accessedAt" >= 1631903400) AND ("accessedAt" <= 1633112999))
         Buffers: shared hit=3 read=1824
         I/O Timings: read=62.423
 Planning:
   Buffers: shared hit=20 read=1
   I/O Timings: read=0.012
 Planning Time: 1.168 ms
 Execution Time: 21198.731 ms

After dropping the index.

 Seq Scan on access_histories  (cost=0.00..155405.02 rows=715992 width=8) (actual time=0.035..10278.345 rows=715535 loops=1)
   Filter: (("accessedAt" >= 1631903400) AND ("accessedAt" <= 1633112999))
   Rows Removed by Filter: 4234466
   Buffers: shared hit=23814 read=57341
   I/O Timings: read=2083.283
 Planning:
   Buffers: shared hit=76 read=17
   I/O Timings: read=7.851
 Planning Time: 8.286 ms
 Execution Time: 10316.716 ms
(10 rows)

So my question is why is postgres opting to go for index scan although sequential is less expensive.

p.s I know that postgres will go for seq scan if it has to query 5-10% of total data.

  • What percentage of the rows is your query selecting (in average)? – The Impaler Dec 12 '21 at 05:49
  • Try the index `access_histories (accessedAt, id)`. – The Impaler Dec 12 '21 at 05:51
  • 1
    If you look at your 4 plans (preedit and current) we see that repeat executions of the same plan vary a lot on timing between them. Given such incoherent timing with one plan i don't think it makes much sense to try to explain differences between plans. Who knows if the difference is even real and not just a random variation? – jjanes Dec 12 '21 at 06:23
  • On an average varies on the time range selected. @The Impaler – Rishu Srivastav Dec 12 '21 at 08:03
  • @jjanes So my question is why is postgres opting to go for index scan although sequential is less expensive. – Rishu Srivastav Dec 12 '21 at 08:10
  • There is no rule that says PostgreSQL will go for a sequential scan when it has to query 5 to 10 % of the data. This depends completely on your database configuration and your queries. – Frank Heikens Dec 12 '21 at 11:03
  • @FrankHeikens check the link I have mentioned in question. – Rishu Srivastav Dec 13 '21 at 04:33
  • @RishuSrivastav: That answer doesn't mention the configuration at all and is not complete. Turn enable_seqscan off and you'll see how query plans change. Change the random IO cost, and you'll see again how the query plans change. – Frank Heikens Dec 13 '21 at 07:23

1 Answers1

1

There are several things that can work together in this:

  1. work_mem is too low for the bitmap heap scan to be as effective as PostgreSQL things it should be (Heap Blocks: lossy=33133)

    If work_mem is too small to contain a bitmap that has one bit per table row, PostgreSQL will at some point degrade the bitmap to one bit per 8kB block. For such “lossy” blocks, all rows of the block have to be rechecked if they satisfy the condition, which means extra work.

  2. During the second query, more data are cached (use EXPLAIN (ANALYZE, BUFFERS) to keep track of that)

Laurenz Albe
  • 209,280
  • 17
  • 206
  • 263