Optimize query on big table executing generate_series()

Question

The following query takes more than 7 minutes in PostgreSQL 11.1:

SELECT 
    '2019-01-19' as date, 
    '2019-01-19'::date - generate_series(first_observed, last_observed, interval '1 day')::date as days_to_date, 
    ROUND(AVG(price)) as price,
    area_id
FROM 
    table_example
GROUP BY 
    days_to_date, area_id;

table_example has around 15 million rows.
Are there any ways to optimize it? I have already added the following indexes:

CREATE INDEX ON table_example (first_observed, last_observed);
CREATE INDEX ON table_example (area_id);

This is output from EXPLAIN (ANALYZE,BUFFERS):

GroupAggregate  (cost=3235559683.68..3377398628.68 rows=1418000 width=72) (actual time=334933.966..440096.869 rows=21688 loops=1)
  Group Key: (('2019-01-19'::date - ((generate_series((first_observed)::timestamp with time zone, (last_observed)::timestamp with time zone, '1 day'::interval)))::date)), area_id
  Buffers: local read=118167 dirtied=118167 written=117143, temp read=1634631 written=1635058
  ->  Sort  (cost=3235559683.68..3271009671.18 rows=14179995000 width=40) (actual time=334923.933..391690.184 rows=380203171 loops=1)
        Sort Key: (('2019-01-19'::date - ((generate_series((first_observed)::timestamp with time zone, (last_observed)::timestamp with time zone, '1 day'::interval)))::date)), area_id
        Sort Method: external merge  Disk: 9187584kB
        Buffers: local read=118167 dirtied=118167 written=117143, temp read=1634631 written=1635058
        ->  Result  (cost=0.00..390387079.39 rows=14179995000 width=40) (actual time=214.798..171717.941 rows=380203171 loops=1)
              Buffers: local read=118167 dirtied=118167 written=117143
              ->  ProjectSet  (cost=0.00..71337191.89 rows=14179995000 width=44) (actual time=214.796..102823.749 rows=380203171 loops=1)
                    Buffers: local read=118167 dirtied=118167 written=117143
                    ->  Seq Scan on table_example  (cost=0.00..259966.95 rows=14179995 width=44) (actual time=0.031..2449.511 rows=14179995 loops=1)
                          Buffers: local read=118167 dirtied=118167 written=117143
Planning Time: 0.409 ms
JIT:
  Functions: 18
  Options: Inlining true, Optimization true, Expressions true, Deforming true
  Timing: Generation 5.034 ms, Inlining 13.010 ms, Optimization 121.440 ms, Emission 79.996 ms, Total 219.480 ms
Execution Time: 441133.410 ms

This is what table_example looks like:

column name        data type
'house_pk'         'integer'    
'date_in'          'date'   
'first_observed'   'date'   
'last_observed'    'date'   
'price'            'numeric'    
'area_id'          'integer'

There are 60 distinct area_ids.

Query is being run on a multi-core machine (24 cores) with 128 GB of memory. It is possible that settings are not optimal, however.

That's a lot of data -- and you have no filters in the query. I doubt you'll be able to get much improvement. — Gordon Linoff, Jan 17 '19 at 18:46
Table definition would help - `CREATE TABLE` statement with data types and constraints. Plus output of `EXPLAIN (ANALYZE, BUFFERS)`. See: https://stackoverflow.com/tags/postgresql-performance/info Do you need results for all days and all areas at once? Do you have an `area` table with 1 row per relevant `area_id`? How many distinct `area_id`? How many days between `first_observed` & `last_observed` - min/max/avg? — Erwin Brandstetter, Jan 18 '19 at 00:07
@ErwinBrandstetter 1) I have now added info on `EXPLAIN (ANALYZE, BUFFERS)`. 2) I do need results for all days at once. I could do multiple queries for areas but that would require doing 60 queries instead of one. 3) I'm not sure what you mean. I do have a table containing the area_ids with one row per area_id. Is that what you mean? 4) 60 area_ids, 5) min = 0, max=335, avg=25.8 — Wessi, Jan 18 '19 at 09:05
A complete `CREATE TABLE` statement is the way to disclose information about your table. Example: https://stackoverflow.com/questions/9789736/how-to-implement-a-many-to-many-relationship-in-postgresql/9790225#9790225 Various clients (incl. pgAdmin3 & pgAdmin4) display the reverse engineered statement. — Erwin Brandstetter, Jan 18 '19 at 17:33

Erwin Brandstetter · Accepted Answer · 2019-01-21T01:41:55.227

1

While processing the whole table, indexes are typically useless (with the possible exception of an index-only scan if table rows are much wider than the index).

And while processing the whole table, I don't see much room for performance optimization of the query itself. One minor thing:

SELECT d.the_date
     , generate_series(d.the_date - last_observed
                     , d.the_date - first_observed) AS days_to_date
     , round(avg(price)) AS price
     , area_id
FROM   table_example
     , (SELECT date '2019-01-19') AS d(the_date)
GROUP  BY days_to_date, area_id;

Assuming first_observed & last_observed are date NOT NULL and always < date '2019-01-19'. Else you need to cast / do more.

This way, you have only two subtractions and then generate_series() works with integers (fastest).

The added mini-subquery is just for convenience, to only provide the date once. In a prepared statement or function, you can use a parameter and don't need this:

     , (SELECT date '2019-01-19') AS d(the_date)

Other than that, if EXPLAIN (ANALYZE, BUFFERS) mentions "Disk" (example: Sort Method: external merge Disk: 3240kB), then a (temporary) higher setting for work_mem should help. See:

If you can't afford more RAM and the aggregate and/or sort steps still spill to disk, it might help to divide & conquer with a query like, using a LATERAL join:

SELECT d.the_date, f.*, a.area_id
FROM   area a
     , (SELECT date '2019-01-19') AS d(the_date)
     , LATERAL (
   SELECT generate_series(d.the_date - last_observed
                        , d.the_date - first_observed) AS days_to_date
        , round(avg(price)) AS price
   FROM   table_example
   WHERE  area_id = a.area_id
   GROUP  BY 1
   ) f;

Assuming a table area, obviously.

edited Jan 21 '19 at 01:41

answered Jan 18 '19 at 01:12

Erwin Brandstetter

605,456
145
1,078
1,228

Amazing difference. From 6 minutes to 8 seconds. For other reading this answer please note that the results of the two ways of using generate_series are not identical. The fast way only create an row where there is a change in price. For my use case that is actually better, but there might be cases where you want a row for every day in series. – Wessi Jan 18 '19 at 09:49
On a second look I realize that `first_observed` & `last_observed` need to be switched in my two queries if `last_observed` is expected to be *later* than `first_observed`. So I am confused why this happens to work for you. There are *3* queries here (2 in my answer, 1 in the question), which one takes 6 min and which one 8 sec? (I see 7.4 min in your question.) And: `where there is a change in price` - what's the meaning of this? – Erwin Brandstetter Jan 18 '19 at 17:43
You are right. It worked in one particular edge case, but does not generally work. They need to be flipped and most of the performance gain disappears. My comments were not about the query using LATERAL. – Wessi Jan 20 '19 at 14:54
So more like 7.4 to 4.9 minutes which is still a big gain. – Wessi Jan 20 '19 at 15:09
@Wessi: I flipped the dates to fix the query accordingly. And yes, still an amazing gain for a little optimization magic. To make it faster, yet, consider what I wrote about `work_mem` - may not change much (depending on your hardware) but it's typically considerably faster to avoid spilling to disk. – Erwin Brandstetter Jan 21 '19 at 01:51
yes, it's quite amazing. Thanks a lot. I will try to see if it can be even faster with higher `work_mem`. – Wessi Jan 21 '19 at 08:19

Optimize query on big table executing generate_series()

1 Answers1