Optimize query with multiple "between" conditions

Question

I have a table playground with column val, column val is indexed.

I have a list of ranges [(min1, max1), (min2, max2), ... , (minN, maxN)] and I want to select all rows with val that fit in any of those ranges.

E.g. my ranges looks like that: [(1,5), (20,25), (200,400)] Here is the simple query that extracts corresponding rows:

select p.*
from playground p
where (val between 1 AND 5) or (val between 20 and 25) or
    (val between 200 and 400);

The problem here is that this list of ranges is dynamic, my application generates it and sends it along with the query to postgres.

I tried to rewrite the query to accept dynamic list of ranges:

select p.*
from playground p,
    unnest(ARRAY [(1, 5),(20, 25),(200, 400)]) as r(min_val INT, max_val INT)
where p.val between r.min_val and r.max_val;

It extracts the same rows, but I don't know is an effective query or not?

This is how the explain looks like for the first query:

Bitmap Heap Scan on playground p  (cost=12.43..16.45 rows=1 width=36) (actual time=0.017..0.018 rows=4 loops=1)
  Recheck Cond: (((val >= 1) AND (val <= 5)) OR ((val >= 20) AND (val <= 25)) OR ((val >= 200) AND (val <= 400)))
  Heap Blocks: exact=1
  ->  BitmapOr  (cost=12.43..12.43 rows=1 width=0) (actual time=0.012..0.012 rows=0 loops=1)
        ->  Bitmap Index Scan on playground_val_index  (cost=0.00..4.14 rows=1 width=0) (actual time=0.010..0.010 rows=3 loops=1)
              Index Cond: ((val >= 1) AND (val <= 5))
        ->  Bitmap Index Scan on playground_val_index  (cost=0.00..4.14 rows=1 width=0) (actual time=0.001..0.001 rows=0 loops=1)
              Index Cond: ((val >= 20) AND (val <= 25))
        ->  Bitmap Index Scan on playground_val_index  (cost=0.00..4.14 rows=1 width=0) (actual time=0.001..0.001 rows=1 loops=1)
              Index Cond: ((val >= 200) AND (val <= 400))
Planning Time: 0.071 ms
Execution Time: 0.057 ms

And here is the explain for the second:

Nested Loop  (cost=0.14..12.52 rows=2 width=36) (actual time=0.033..0.065 rows=4 loops=1)
  ->  Function Scan on unnest r  (cost=0.00..0.03 rows=3 width=8) (actual time=0.011..0.012 rows=3 loops=1)
  ->  Index Scan using playground_val_index on playground p  (cost=0.13..4.15 rows=1 width=36) (actual time=0.008..0.015 rows=1 loops=3)
        Index Cond: ((val >= r.min_val) AND (val <= r.max_val))
Planning Time: 0.148 ms
Execution Time: 0.714 ms

NOTE: In both cases I did set enable_seqscan = false; to make the index work.

I am worried about the "Nested Loop" stage. Is it Okay? Or there are more effective ways to pass dynamic list of ranges into a query? My postgres version is 12.1

. .The nested loop is to go through the values in the `VALUES()`. I don't think there is an alternative. — Gordon Linoff, Mar 23 '20 at 18:12
There *may* be potential to optimize. Depends on undisclosed details. Consider instructions here: https://stackoverflow.com/tags/postgresql-performance/info. Also, you lead with `I want to select all rows ... `, but that's not what your query does: it selects a *single column* allowing the best case scenario of an index-only scan. You need to be precise when trying to optimize ... Also important: will ranges in your passed array ever overlap? If so, I assume you want each match only *once*? If not, more ways to optimize ... — Erwin Brandstetter, Mar 23 '20 at 18:26
@ErwinBrandstetter Thanks for pointing out, I updated the post. You are right, I need the whole row, no just my indexed column, I updated my query to `select p.*` and added the new analyzer ouputs. Also ranges won't overlap my application will ensure it. — Georgy Savva, Mar 24 '20 at 10:22

score 1 · Accepted Answer · answered Mar 25 '20 at 02:59

You added more information, but much more is relevant, yet. Exact table and index definition, cardinality, data distribution, row size stats, number of ranges in predicate, purpose of the table, write patterns, ... Performance optimization needs all the input it can get.

Shot in the dark: with non-overlapping ranges, a UNION ALL query may deliver best performance:

SELECT * FROM playground WHERE val BETWEEN   1 AND   5
UNION ALL
SELECT * FROM playground WHERE val BETWEEN  20 AND  25
UNION ALL
SELECT * FROM playground WHERE val BETWEEN 200 AND 400;

We know that ranges don't overlap, but Postgres doesn't, so it has to do extra work in your attempts. This query should avoid both the BitmapOr of the first as well as the Nested Loop of the second plan. Just fetch each range and append to the output. Should result in a plan like:

Append  (cost=0.13..24.50 rows=3 width=40)
  ->  Index Scan using playground_val_idx on playground  (cost=0.13..8.15 rows=1 width=40)
        Index Cond: ((val >= 1) AND (val <= 5))
  ->  Index Scan using playground_val_idx on playground playground_1  (cost=0.13..8.15 rows=1 width=40)
        Index Cond: ((val >= 20) AND (val <= 25))
  ->  Index Scan using playground_val_idx on playground playground_2  (cost=0.13..8.15 rows=1 width=40)
        Index Cond: ((val >= 200) AND (val <= 400))

Plus, each sub-SELECT will be based on actual statistics for the given range, not generic estimates, even for a longer list of ranges. See (recommended!):

How to use index for simple time range join?

You can generate the query in your client or write a server-side function to generate and execute dynamic SQL (applicable as the result type is known).

You might even test a server-side function using a LOOP (which is often less efficient, but this may be an exception):

CREATE OR REPLACE FUNCTION foo(_ranges int[])
  RETURNS SETOF playground LANGUAGE plpgsql PARALLEL SAFE STABLE AS
$func$
DECLARE
   _range   int[];
BEGIN
   FOREACH _range SLICE 1 IN ARRAY _ranges
   LOOP
      RETURN QUERY
      SELECT * FROM playground WHERE val BETWEEN _range[1] AND _range[2];
   END LOOP;
END
$func$;

The overhead may not pay for few ranges in the call. But very convenient to call, if nothing else:

SELECT * FROM foo('{{1,5},{20,25},{200,400}}');

Loop over array dimension in plpgsql

db<>fiddle here

Physical order of rows may help a lot. If rows are stored in sequence, (much) fewer data pages need to be processed. Depends on undisclosed details. Built-in CLUSTER or the extensions pg_repack or pg_squeeze may help with that. Related:

Optimize Postgres timestamp query range

And it's recommended to use the latest available minor release for whatever major version is in use. That would be 12.2 at the time of writing (released 2020-02-13).

Thanks for the such a complete answer. The number of rows in the table is 1M+; The number of rows that fit into ranges is 1K; Result rows are not sequential. I can't provide any more details about my case, because it's hypothetical question about a better query structure what I might be unaware of. UNION ALL looks good for me, but I wonder, what if I need to query 300+ ranges. So my app will generate 300+ long Select statements; The same concern with generated "Between" statements. Or query of such length is totally find for postgres? — Georgy Savva, Mar 25 '20 at 07:27
@GeorgySavva: 1k rows per range will likely result in bitmap index scans. Physically sorted rows would make a *huge* difference. Query of such length is no problem for postgres. Please test my suggestions (incl. the function) and report results. — Erwin Brandstetter, Mar 25 '20 at 11:40
Thanks for your help. I will go with auto-generated `between` statements. — Georgy Savva, Apr 02 '20 at 08:06
No, I decided not to run any benchmarks for now. I will measure it on production when it will be ready. Thanks — Georgy Savva, Apr 02 '20 at 14:14

Optimize query with multiple "between" conditions

1 Answers1