3

This is related to 2 other questions I posted (sounds like I should post this as a new question) - the feedback helped, but I think the same issue will come back the next time I need to insert data. Things were running slowly still which forced me to temporarily remove some of the older data so that only 2 months' worth remained in the table that I'm querying.

Indexing strategy for different combinations of WHERE clauses incl. text patterns

How to get date_part query to hit index?

Giving further detail this time - hopefully it will help pinpoint the issue:

  • PG version 10.7 (running on heroku
  • Total DB size: 18.4GB (this contains 2 months worth of data, and it will grow at approximately the same rate each month)
  • 15GB RAM
  • Total available storage: 512GB
  • The largest table (the one that the slowest query is acting on) is 9.6GB (it's the largest chunk of the total DB) - about 10 million records

Schema of the largest table:

-- Table Definition ----------------------------------------------

CREATE TABLE reportimpression (
    datelocal timestamp without time zone,
    devicename text,
    network text,
    sitecode text,
    advertisername text,
    mediafilename text,
    gender text,
    agegroup text,
    views integer,
    impressions integer,
    dwelltime numeric
);

-- Indices -------------------------------------------------------

CREATE INDEX reportimpression_feb2019_index ON reportimpression(datelocal timestamp_ops) WHERE datelocal >= '2019-02-01 00:00:00'::timestamp without time zone AND datelocal < '2019-03-01 00:00:00'::timestamp without time zone;
CREATE INDEX reportimpression_mar2019_index ON reportimpression(datelocal timestamp_ops) WHERE datelocal >= '2019-03-01 00:00:00'::timestamp without time zone AND datelocal < '2019-04-01 00:00:00'::timestamp without time zone;
CREATE INDEX reportimpression_jan2019_index ON reportimpression(datelocal timestamp_ops) WHERE datelocal >= '2019-01-01 00:00:00'::timestamp without time zone AND datelocal < '2019-02-01 00:00:00'::timestamp without time zone;

Slow query:

SELECT
    date_part('hour', datelocal) AS hour,
    SUM(CASE WHEN gender = 'male' THEN views ELSE 0 END) AS male,
    SUM(CASE WHEN gender = 'female' THEN views ELSE 0 END) AS female
FROM reportimpression
WHERE
    datelocal >= '3-1-2019' AND
    datelocal < '4-1-2019'
GROUP BY date_part('hour', datelocal)
ORDER BY date_part('hour', datelocal)

The date range in this query will generally be for an entire month (it accepts user input from a web based report) - as you can see, I tried creating an index for each month's worth of data. That helped, but as far as I can tell, unless the query has recently been run (putting the results into the cache), it can still take up to a minute to run.

Explain analyze results:

Finalize GroupAggregate  (cost=1035890.38..1035897.86 rows=1361 width=24) (actual time=3536.089..3536.108 rows=24 loops=1)
  Group Key: (date_part('hour'::text, datelocal))
  ->  Sort  (cost=1035890.38..1035891.06 rows=1361 width=24) (actual time=3536.083..3536.087 rows=48 loops=1)
        Sort Key: (date_part('hour'::text, datelocal))
        Sort Method: quicksort  Memory: 28kB
        ->  Gather  (cost=1035735.34..1035876.21 rows=1361 width=24) (actual time=3535.926..3579.818 rows=48 loops=1)
              Workers Planned: 1
              Workers Launched: 1
              ->  Partial HashAggregate  (cost=1034735.34..1034740.11 rows=1361 width=24) (actual time=3532.917..3532.933 rows=24 loops=2)
                    Group Key: date_part('hour'::text, datelocal)
                    ->  Parallel Index Scan using reportimpression_mar2019_index on reportimpression  (cost=0.09..1026482.42 rows=3301168 width=17) (actual time=0.045..2132.174 rows=2801158 loops=2)
Planning time: 0.517 ms
Execution time: 3579.965 ms

I wouldn't think 10 million records would be too much to handle, especially given that I recently bumped up the PG plan that I'm on to try to throw resources at it, so I assume that the issue is still just either my indexes or my queries not being very efficient.

dgwebb
  • 321
  • 3
  • 16
  • 1
    Like I said, not the main issue. Still better in every respect. The main issue here is `rows=2801158`. Aggregating close to 3M rows won't fast. But it can be substantially faster, yet. To optimize indexing strategies, we need to know the *range* of possible queries, not just an example query, which may well misguide. Do you always aggregate by the hour? Always sum `views`? Always divided by gender? – Erwin Brandstetter Apr 05 '19 at 19:03
  • For this query, it needs to be by hour. For example, the result set that I need in the report should spit out 24 records (one for each hour), with SUM total of views for each gender in each hour. – dgwebb Apr 05 '19 at 21:27
  • 1
    So you want to optimize your DB for ***this query***? Did I get that right? Then your best course of action is bullet point 1 in Laurenz' answer. A materialized view should cover that perfectly. You may still want to optimize the underlying query, but that's less important then, and a tailored index for the purpose may not pay. – Erwin Brandstetter Apr 05 '19 at 21:40

2 Answers2

3

A materialized view is the way to go for what you outlined. Querying past months of read-only data works without refreshing it. You may want to special-case the current month if you need to cover that, too.

The underlying query can still benefit from an index, and there are two directions you might take:

First off, partial indexes like you have now won't buy much in your scenario, not worth it. If you collect many more months of data and mostly query by month (and add / drop rows by month) table partitioning might be an idea, then you have your indexes partitioned automatically, too. I'd consider Postgres 11 or even the upcoming Postgres 12 for this, though.)

If your rows are wide, create an index that allows index-only scans. Like:

CREATE INDEX reportimpression_covering_idx ON reportimpression(datelocal, views, gender);

Related:

Or INCLUDE additional columns in Postgres 11 or later:

CREATE INDEX reportimpression_covering_idx ON reportimpression(datelocal) INCLUDE (views, gender);

Else, if your rows are physically sorted by datelocal, consider a BRIN index. It's extremely small and probably about as fast as a B-tree index for your case. (But being so small it will stay cached much easier and not push other data out as much.)

CREATE INDEX reportimpression_brin_idx ON reportimpression USING BRIN (datelocal);

You may be interested in CLUSTER or pg_repack to physically sort table rows. pg_repack can do it without exclusive locks on the table and even without a btree index (required by CLUSTER). But it's an additional module not shipped with the standard distribution of Postgres.

Related:

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • "physically sorted" - meaning that as long as I'm inserting all of the records in order by date, then a BRIN index should be much more performant? Also - after I make any of these changes, is there a way to ensure that the index is up to date so that when I run analyze again, that I'm seeing real results? Do I need to run vacuum analyze after creating those indexes? – dgwebb Apr 05 '19 at 22:04
  • 1
    @dgwebb: Yes, BRIN (block range) indexes operate on whole ranges of blocks (data pages) instead of individual rows. Should work marvellously for your setup as long as rows are physically sorted. If you insert sorted by date, and then not update, that should work flawlessly. Else, consider `CLUSTER` after upsetting physical order (followed by `VACUUM ANALYZE`). – Erwin Brandstetter Apr 05 '19 at 22:11
  • Thanks! Will try out all of the above to see what works out the best. Cheers. – dgwebb Apr 05 '19 at 22:12
  • I made some optimizations based on your feedback, and am now testing out index performance on the datelocal column. I've gotten very good results even with the plain btree after using CLUSTER, but it looks like a BRIN would be much more appropriate. I just noticed that CLUSTER can't be used on a BRIN (not sure why that is), but what's the best solution to resort a table if you can't use CLUSTER? – dgwebb Apr 19 '19 at 22:16
  • `CLUSTER` is the best option in factory Postgres. [`pg_repack`](https://github.com/reorg/pg_repack) is the possibly better alternative, but has to be installed as additional module, not shipped with the standard distribution. It can order by specified columns instead of index. Else, consider: https://dba.stackexchange.com/a/62970/3684. Please ask new questions as new *questions*. Comments are not the place. – Erwin Brandstetter Apr 19 '19 at 22:52
2

Your execution plan seems to be doing the right thing.

Things you can do to improve, in descending order of effectiveness:

  • Use a materialized view that pre-aggregates the data

  • Don't use a hosted database, use your own iron with good local storage and lots of RAM.

  • Use only one index instead of several partitioned ones. This is not primarily a performance advice (the query will probably not be measurably slower unless you have a lot of indexes), but it will ease the management burden.

Laurenz Albe
  • 209,280
  • 17
  • 206
  • 263
  • I'm not super knowledgable about databases, but it was my understanding that views still execute the underlying queries (and thus don't offer a performance bump)? Is that incorrect? Also, in my case, hosting my own isn't really an option (I don't have the expertise or time to do it). For the index - I had initially had only one index on this table on the datelocal column. Should I switch back to that? – dgwebb Apr 05 '19 at 21:23
  • 1
    @dgwebb: A **[materialized view](https://www.postgresql.org/docs/current/rules-materializedviews.html)** does .. well .. *materialize* results - as opposed to plain views. – Erwin Brandstetter Apr 05 '19 at 21:43
  • Interesting! That's a completely new one on me. Will try it out. Thanks to you both. – dgwebb Apr 05 '19 at 21:45
  • If I'm reading the docs right - it looks like REFRESH MATERIALIZED VIEW would need to be run after any time an INSERT or UPDATE is performed? – dgwebb Apr 05 '19 at 21:48
  • @dgwebb: only if you want to update it. Querying past months of read-only data would work without update. You may want to special-case the current month if you need to cover that, too. – Erwin Brandstetter Apr 05 '19 at 21:49