Bitmap scan in Postgres 8.3 2x faster than index scan Postgres 9.4?

Question

Upgrading Postgres from 8.3.8 to 9.4.1 on new hardware. A representative set of queries shows that new system's performance ranges from 1x to 3x faster. However, one of our high-load areas is always slower.

`EXPLAIN` output

8.3.8:

Nested Loop  (cost=25.78..709859.61 rows=1 width=4) (actual time=14.972..190.591 rows=32 loops=1)
  ->  Bitmap Heap Scan on prime p  (cost=25.78..1626.92 rows=1066 width=4) (actual time=1.567..9.597 rows=10742 loops=1)
        Recheck Cond: ((pid = ANY ('{28226,53915,83421,82118397,95513866}'::integer[])) AND (tid = ANY ('{1,2,3}'::integer[])))
        Filter: (NOT deleted)
        ->  Bitmap Index Scan on FOO_IDX1  (cost=0.00..25.73 rows=1066 width=0) (actual time=1.144..1.144 rows=10742 loops=1)
              Index Cond: ((pid = ANY ('{28226,53915,83421,82118397,95513866}'::integer[])) AND (deleted = false) AND (tid = ANY ('{1,2,3}'::integer[])))
  ->  Index Scan using FOO_IDX2 on data d  (cost=0.00..663.88 rows=1 width=4) (actual time=0.017..0.017 rows=0 loops=10742)
        Index Cond: (d.pid = p.pid)
        Filter: (lower("substring"(d.value, 1, 1000)) ~~ '%something%'::text)
Total runtime: 190.639 ms

9.4.1:

Nested Loop  (cost=1.15..335959.94 rows=1 width=4) (actual time=24.712..365.057 rows=32 loops=1)
  ->  Index Scan using FOO_IDX1 on prime p  (cost=0.57..953.17 rows=1033 width=4) (actual time=0.048..13.884 rows=10741 loops=1)
        Index Cond: ((pid = ANY ('{28226,53915,83421,82118397,95513866}'::integer[])) AND (deleted = false) AND (tid = ANY ('{1,2,3}'::integer[])))
        Filter: (NOT deleted)
  ->  Index Scan using FOO_IDX2 on data d  (cost=0.57..324.29 rows=1 width=4) (actual time=0.032..0.032 rows=0 loops=10741)
        Index Cond: (pid = p.pid)
        Filter: (lower("substring"(value, 1, 1000)) ~~ '%something%'::text)
        Rows Removed by Filter: 11
Planning time: 0.940 ms
Execution time: 365.156 ms

Indexes

…btree (pid);
…btree (lower("substring"(value, 1, 1000)) text_pattern_ops, fid);
…btree (lower("substring"(value, 1, 1000)), fid);

Settings

Varying the ranges of the following did not improve this case…

checkpoint_completion_target = 0.5
checkpoint_segments = 32
checkpoint_timeout = 30min
cpu_index_tuple_cost = 0.005
cpu_operator_cost = 0.0025
cpu_tuple_cost = 0.01
default_statistics_target = 500 (evaluated 100 to 10000 analyse after each)
effective_cache_size = 288GB
enable_seqscan = off
from_collapse_limit = 8
geqo = off
join_collapse_limit = 8
random_page_cost = 1.0
seq_page_cost = 1.0
shared_buffers = 96GB
work_mem = 64MB

We see a similar result for something% too.

Before we drop this in for a few years I'd like to know if there is anything more I could do to optimise for these important cases.

Statement

SELECT p.pid
FROM prime p
    INNER JOIN data d ON p.pid = d.pid
WHERE LOWER(substring(d.value,1,1000)) LIKE '%something%'
    AND p.tid IN (1,2,3)
    AND p.deleted = FALSE
    AND p.ppid IN (28226, 53915, 83421, 82118397, 95513866)

Table definitions

Simplified and sanitized.

\d prime

    Column     |            Type             |                    Modifiers
---------------+-----------------------------+-------------------------------------------------
 pid           | integer                     | not null default nextval('prime_seq'::regclass)
 deleted       | boolean                     |
 ppid          | integer                     |
 tid           | integer                     |

\d data

     Column     |  Type   |                      Modifiers
----------------+---------+------------------------------------------------------
 pdid           | integer | not null default nextval('data_seq'::regclass)
 pid            | integer |
 value          | text    |

New test results

I've tried a range of default_statistics_target.

default_statistics_target = 100  @ 381 ms
default_statistics_target = 500  @ 387 ms
default_statistics_target = 1000 @ 384 ms
default_statistics_target = 5000 @ 369 ms

(analyze and warm-up between test cycles)

This value can make a substantial difference in other areas of our application. 500 seems ideal, 5000+ caused other areas to slow down 3x to 10x.

Our kit is designed such that the entire database should always be in memory.

random_page_cost =  1.0 @ 372 ms
random_page_cost =  1.1 @ 372 ms 
random_page_cost =  4.0 @ 370 ms 
random_page_cost = 10.0 @ 369 ms

With enable_bitmapscan = off @ 362 ms (results in the same plan as expected)

Earlier I also tried enable_indexscan = off @ 491 ms (triggered different plan of course)

Yes, pg 8.3's plan uses an index and a bitmap index scan - which I think is the "nut" of this issue.

Thanks for the link to the related article.

The advice on column order is very interesting.

At our scale and growing, what would the optimal field order be for the following schemas?
What is the most effective way to reorganize the column order on a loaded table to realize the benefit?

prime having:

integer
text
boolean
boolean
integer
integer
smallint
integer
timestamp without time zone
timestamp without time zone
timestamp without time zone
text

data having:

integer
integer
integer
text

SELECT pid
FROM data d
  JOIN prime p USING (pid)
WHERE LOWER(substring(d.value,1,1000)) LIKE '%something%'
  AND p.ppid IN (28226, 53915, 83421, 82118397, 95513866) 
  AND p.tid IN (1, 2, 3)
  AND p.deleted = FALSE;

No measurable difference observed in this approach, same plan (+/-5msec)
We generally try to reduce the scope of records searched in data first by using prime to check acl, status, etc. (prime is 1/10 the size)

lower(substring(d.value,1,1000)) = 355 ms
lower(left(d.value,1000))        = 343 ms (~3% faster over multiple tests, I'll take that!)

To handle the unanchored case we have a second index using the operator class "text_pattern_ops".

We've evaluated multi-column GIN indexes before, but didn't realize the expected benefits. Complex because A) multiple criteria to meet on acl, status, and similar, B) need to hit "exact phrases" which require a result phrase recheck. I'm optimistic on using a full-text approach long term, the recipes we've tried so far are not faster or more stable than the old-school BTREE approach; yet.

GIN trial 1

CREATE EXTENSION btree_gin
CREATE INDEX FOO_IDX3 ON data USING GIN (to_tsvector('simple', lower(left(value, 1000))), pid)
ANALYSE data

SELECT p.pid
FROM prime p
  INNER JOIN data d ON p.pid = d.pid
WHERE to_tsvector('simple', lower(left(d.value, 1000))) @@ to_tsquery('simple', 'something')
  AND p.tid IN (1,2,3)
  AND p.deleted = FALSE
  AND p.ppid IN (28226, 53915, 83421, 82118397, 95513866)

Execution time: 1034.866 ms (without phrase recheck)

GIN trial 2

CREATE EXTENSION pg_trgm 
CREATE INDEX FOO_IDX4 ON data USING gin (left(value,1000) gin_trgm_ops, pid);
ANALYSE data

SELECT p.pid
FROM prime p
  INNER JOIN data d ON p.pid = d.pid
WHERE left(d.value,1000) LIKE '%Something%'
  AND p.tid IN (1,2,3)
  AND p.deleted = FALSE
  AND p.ppid IN (28226, 53915, 83421, 82118397, 95513866)


Hash Join  (cost=2870.42..29050.89 rows=1 width=4) (actual time=668.333..2262.101 rows=32 loops=1)
  Hash Cond: (d.pid = p.pid)
  ->  Bitmap Heap Scan on data d  (cost=230.30..26250.04 rows=25716 width=4) (actual time=653.130..2234.736 rows=38659 loops=1)
        Recheck Cond: ("left"(value, 1000) ~~ '%Something%'::text)
        Rows Removed by Index Recheck: 146677
        Heap Blocks: exact=161810
        ->  Bitmap Index Scan on FOO_IDX4  (cost=0.00..223.87 rows=25716 width=0) (actual time=575.442..575.442 rows=185336 loops=1)
              Index Cond: ("left"(value, 1000) ~~ '%Something%'::text)
  ->  Hash  (cost=2604.33..2604.33 rows=2863 width=4) (actual time=15.158..15.158 rows=10741 loops=1)
        Buckets: 1024  Batches: 1  Memory Usage: 378kB
        ->  Index Scan using FOO_IDX4 on prime p  (cost=0.57..2604.33 rows=2863 width=4) (actual time=0.064..11.737 rows=10741 loops=1)
              Index Cond: ((ppid = ANY ('{28226,53915,83421,82118397,95513866}'::integer[])) AND (deleted = false) AND (tid = ANY ('{1,2,3}'::integer[])))
              Filter: (NOT deleted)
Planning time: 1.861 ms
Execution time: 2262.210 ms

We already have an index on prime w/ "ppid, deleted, tid", sorry this was not clear originally.

I don't see your actual query in the question. (Relevant parts of the) table definitions would be helpful, too. Check the tag info of `[postgresql-performance]` for instructions. — Erwin Brandstetter, Mar 20 '15 at 05:10
Added. I'm not sure more index/constraint info be helpful in this case, so left it aside for now. — Nothrock, Mar 20 '15 at 20:29
@Erwin Thank you... I'm reviewing all your comments carefully and although we've done similar tests, I'll retest the specific cases and suggestions you described and report the results. Currently we have a pretty interesting case in-hand - a large stable system with the ability to systemically run A/B tests until this new kit gets released. — Nothrock, Mar 20 '15 at 22:53
The later changes `ppid` <-> `pid` make a difference. It may pay to append `ppid` to the index in my answer as last item to allow index-only scans. And adapt the query, of course. — Erwin Brandstetter, Mar 20 '15 at 23:21
@Erwin In response to your suggestions and my retests I've added results observed and comments. — Nothrock, Mar 22 '15 at 04:38

Erwin Brandstetter · Answer 1 · 2023-08-01T22:01:48.657

The most common reasons for bad query plans are column statistics or cost settings not representing reality well:

Keep PostgreSQL from sometimes choosing a bad query plan

A setting of random_page_cost = 1.0 only makes sense if random access is actually as fast as sequential access, which is only true if your DB resides in RAM entirely. A database with tables of 80M and 750M rows is probably too big for that. If my assumption is right, raising that cost setting a bit might fix the problem. Try at least 1.1, probably more. Run tests to find the sweet spot in your setup.

Normally I would first run:

SET enable_bitmapscan = off;

in the current session in pg 9.4 and then test again. The tricky part is that your query may need both: index- and a bitmap index scan. I would need to see the query.

Extremely low settings for random_page_cost favors index scans over bitmap index scans. If that cost setting is misleading, you get inferior query plans.

Answers to this related question on dba.SE have more explanation:

Pagination with PostgreSQL 9.3: counting number of pages

Table design

Your table design is simplified, but generally it's better not to place a boolen column *between integer columns, because you waste disk space to padding this way. Better:

pid           | integer  | not null default nextval('prime_seq'::regclass)
tid           | integer  | 
deleted       | boolean  |

Calculating and saving space in PostgreSQL

This is just a minor improvement, but it has no downside.

Query

Can be improved in multiple ways:

SELECT pid
FROM   data  d
JOIN   prime p USING (pid)
WHERE  left(d.value,1000) LIKE '%something%'
AND    p.pid IN (28226, 53915, 83421, 82118397, 95513866) 
AND    p.tid IN (1, 2, 3)
AND    p.deleted = false;

left(d.value,1000) is shorter and faster than substring(d.value,1,1000) (needs pg 9.1+).
A text_pattern_ops index is only useful for left-anchored pattern matching with LIKE. Your expression is not anchored. (I have seen you are using an anchored pattern, too.) For this, use a trigram GIN index, provided by the additional module pg_trgm that's hugely faster with big tables, especially in pg 9.4 (improved GIN indexes).
- PostgreSQL LIKE query performance variations
- Pattern matching with LIKE, SIMILAR TO or regular expressions

Indices

To include the integer column pid in the GIN index below, first install the additional module btree_gin, which provides the necessary GIN operator classes. Run once per database:

CREATE EXTENSION btree_gin;

Making some assumptions, this would be perfect for your query. A multicolumn trigram GIN index on data:

CREATE INDEX data_value_gin_trgm_idx ON data
USING gin (left(value,1000) gin_trgm_ops, pid);

And a partial multicolumn index on prime:

CREATE INDEX prime_pid_tip_idx ON prime (pid, tip)
WHERE  deleted = false;

Talking order of magnitude here.

Nothrock · Answer 2 · 2015-03-24T09:06:53.427

A slightly different approach to first produce a maximum possible result set from the "data table" via a Common Table Expression (CTE), then back onto prime for refinement by acl, status etc, reduced time from 365 msec to 142 msec (saving 223 msec). This technique appears to be faster than the 8.3 baseline.

WITH d as (SELECT pid
FROM data
WHERE LOWER(left(value,1000)) LIKE '%something%'
AND fid IN (nnn,nnn,...))
SELECT p.pid FROM d INNER JOIN prime p on p.pid = d.pid
WHERE p.tid IN (1,2,3)
AND p.deleted = FALSE
AND p.ppid IN (28226,53915,83421,82118397,95513866)

Planning time: 1.417 ms
Execution time: 141.508 ms

I'll further evaluate CTE for unexpected implications.