3

I have yet to be able to get this query to hit an index instead of performing a full scan - I have another query that uses date_part('day', datelocal) against an almost identical table (that table just has a bit less data but same structure) and that one will hit the index I created on the datelocal column (which is a timestamp without timezone). Query (this one performs a parallel seq scan on the table and does a memory quicksort):

SELECT
    date_part('hour', datelocal) AS hour,
    SUM(CASE WHEN gender LIKE 'male' THEN views ELSE 0 END) AS male,
    SUM(CASE WHEN gender LIKE 'female' THEN views ELSE 0 END) AS female
FROM reportimpression
WHERE datelocal >= '2-1-2019' AND datelocal < '2-28-2019'
GROUP BY date_part('hour', datelocal)
ORDER BY date_part('hour', datelocal)

Here is the other one that does hit my datelocal index:

SELECT
    date_part('day', datelocal) AS day,
    SUM(CASE WHEN gender LIKE 'male' THEN views ELSE 0 END) AS male,
    SUM(CASE WHEN gender LIKE 'female' THEN views ELSE 0 END) AS female
FROM reportimpressionday
WHERE datelocal >= '2-1-2019' AND datelocal < '2-28-2019'
GROUP BY date_trunc('day', datelocal), date_part('day', datelocal)
ORDER BY date_trunc('day', datelocal)

Banging my head about this! Any ideas as to how I can speed up the first one or at least make it hit an index? I've tried creating an index on the datelocal field, a compound index on datelocal, gender, and views, and an expression index on date_part('hour', datelocal) but none of that has worked.

Schemas:

-- Table Definition ----------------------------------------------

CREATE TABLE reportimpression (
    datelocal timestamp without time zone,
    devicename text,
    network text,
    sitecode text,
    advertisername text,
    mediafilename text,
    gender text,
    agegroup text,
    views integer,
    impressions integer,
    dwelltime numeric
);

-- Indices -------------------------------------------------------

CREATE INDEX reportimpression_datelocal_index ON reportimpression(datelocal timestamp_ops);
CREATE INDEX reportimpression_viewership_index ON reportimpression(datelocal timestamp_ops,views int4_ops,impressions int4_ops,gender text_ops,agegroup text_ops);
CREATE INDEX reportimpression_test_index ON reportimpression(datelocal timestamp_ops,(date_part('hour'::text, datelocal)) float8_ops);
-- Table Definition ----------------------------------------------

CREATE TABLE reportimpressionday (
    datelocal timestamp without time zone,
    devicename text,
    network text,
    sitecode text,
    advertisername text,
    mediafilename text,
    gender text,
    agegroup text,
    views integer,
    impressions integer,
    dwelltime numeric
);

-- Indices -------------------------------------------------------

CREATE INDEX reportimpressionday_datelocal_index ON reportimpressionday(datelocal timestamp_ops);
CREATE INDEX reportimpressionday_detail_index ON reportimpressionday(datelocal timestamp_ops,views int4_ops,impressions int4_ops,gender text_ops,agegroup text_ops);

Explain (analyze, buffers) output:

Finalize GroupAggregate  (cost=999842.42..999859.67 rows=3137 width=24) (actual time=43754.700..43754.714 rows=24 loops=1)
  Group Key: (date_part('hour'::text, datelocal))
  Buffers: shared hit=123912 read=823290
  I/O Timings: read=81228.280
  ->  Sort  (cost=999842.42..999843.99 rows=3137 width=24) (actual time=43754.695..43754.698 rows=48 loops=1)
        Sort Key: (date_part('hour'::text, datelocal))
        Sort Method: quicksort  Memory: 28kB
        Buffers: shared hit=123912 read=823290
        I/O Timings: read=81228.280
        ->  Gather  (cost=999481.30..999805.98 rows=3137 width=24) (actual time=43754.520..43777.558 rows=48 loops=1)
              Workers Planned: 1
              Workers Launched: 1
              Buffers: shared hit=123912 read=823290
              I/O Timings: read=81228.280
              ->  Partial HashAggregate  (cost=998481.30..998492.28 rows=3137 width=24) (actual time=43751.649..43751.672 rows=24 loops=2)
                    Group Key: date_part('hour'::text, datelocal)
                    Buffers: shared hit=123912 read=823290
                    I/O Timings: read=81228.280
                    ->  Parallel Seq Scan on reportimpression  (cost=0.00..991555.98 rows=2770129 width=17) (actual time=13.097..42974.126 rows=2338145 loops=2)
                          Filter: ((datelocal >= '2019-02-01 00:00:00'::timestamp without time zone) AND (datelocal < '2019-02-28 00:00:00'::timestamp without time zone))
                          Rows Removed by Filter: 6792750
                          Buffers: shared hit=123912 read=823290
                          I/O Timings: read=81228.280
Planning time: 0.185 ms
Execution time: 43777.701 ms
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
dgwebb
  • 321
  • 3
  • 16
  • 1
    Why are you grouping by `date_trunc('day', datelocal)` twice? – sticky bit Mar 29 '19 at 18:30
  • The schema as `CREATE TABLE` and `CREATE INDEX` statements along with the plans could help for that question. – sticky bit Mar 29 '19 at 18:31
  • Edited to add the schemas. It's being grouped twice because of how I'm using the output (think of a calendar that could potentially be starting at the end of one month and ending in the next - I don't want, for example, to group both 13th's from each month together). However, that's not the slow query (the first one is). – dgwebb Mar 29 '19 at 18:47
  • Please **[EDIT]** your question and add the execution plan generated using **`explain (analyze, buffers)`** - not just a "simple" explain (formatted text please, not as a screenshot) –  Mar 29 '19 at 18:50
  • Updated with the execution plan output. – dgwebb Mar 29 '19 at 18:53
  • Looks like it's now hitting the "viewership" index - I'm guessing it just took a really long time to be populated before the planner would actually hit it? Any other optimizations/improvements would be greatly appreciated (I'm no DB pro - just muddling along but trying to learn). – dgwebb Mar 29 '19 at 19:14
  • Please remember to always disclose your version of Postgres. `SELECT version();` helps. – Erwin Brandstetter Mar 29 '19 at 20:10

1 Answers1

3

Well, both your queries are on different tables (reportimpression vs. reportimpressionday), so the comparison of the two queries really isn't a comparison. Did you ANALYZE both? Various column statistics also may play a role. Index or table bloat may be different. Does a larger part of all rows qualify for Feb 2019? Etc.

One shot in the dark, compare the percentages for both tables:

SELECT tbl, round(share * 100 / total, 2) As percentage
FROM  (
   SELECT text 'reportimpression' AS tbl
        , count(*)::numeric AS total
        , count(*) FILTER (WHERE datelocal >= '2019-02-01' AND datelocal < '2019-03-01')::numeric AS share
   FROM  reportimpression

   UNION ALL
   SELECT 'reportimpressionday'
        , count(*)
        , count(*) FILTER (WHERE datelocal >= '2019-02-01' AND datelocal < '2019-03-01')
   FROM  reportimpressionday
  ) sub;

Is the one for reportimpression bigger? Then it might just exceed the number for which an index is expected to help.

Generally, your index reportimpression_datelocal_index on (datelocal) looks good for it, and reportimpression_viewership_index even allows index-only scans if autovacuum beats the write load on the table. (Though impressions & agegroup are just dead freight for this and it would work even better without).

Answer

You got 26.6 percent, and day is 26.4 percent for my query. For such a large percentage, indexes are typically not useful at all. A sequential scan is typically the fastest way. Only index-only scans may still make sense if the underlying table is much bigger. (Or you have severe table bloat, and less bloated indexes, which makes indexes more attractive again.)

Your first query may just be across the tipping point. Try narrowing the time frame until you see index-only scans. You won't see (bitmap) index scans with more then roughly 5 % of all rows qualifying (depends on many factors).

Queries

Be that as it may, consider these modified queries:

SELECT date_part('hour', datelocal)                AS hour
     , SUM(views) FILTER (WHERE gender = 'male')   AS male
     , SUM(views) FILTER (WHERE gender = 'female') AS female
FROM   reportimpression
WHERE  datelocal >= '2019-02-01'
AND    datelocal <  '2019-03-01' -- '2019-02-28'  -- ?
GROUP  BY 1
ORDER  BY 1;

SELECT date_trunc('day', datelocal)                AS day
     , SUM(views) FILTER (WHERE gender = 'male')   AS male
     , SUM(views) FILTER (WHERE gender = 'female') AS female
FROM   reportimpressionday
WHERE  datelocal >= '2019-02-01'
AND    datelocal <  '2019-03-01'
GROUP  BY 1
ORDER  BY 1;

Major points

  • When using localized date format like '2-1-2019', go through to_timestamp() with explicit format specifiers. Else this depends on locale settings and might break (silently) when called from a session with different settings. Rather use ISO date / time formats as demonstrated which do not depend on locale settings.

  • Looks like you want to include the whole month of February. But your query misses out on the upper bound. For one, February may have 29 days. An datelocal < '2-28-2019' excludes all of Feb 28 as well. Use datelocal < '2019-03-01' instead.

  • It's cheaper to group & sort by the same expression as you have in the SELECT list if you can. So use date_trunc() there, too. Don't use different expressions without need. If you need the datepart in the result, apply it on the grouped expression, like:

    SELECT date_part('day', date_trunc('day', datelocal)) AS day
    ...
    GROUP  BY date_trunc('day', datelocal)
    ORDER  BY date_trunc('day', datelocal);
    

    A bit more noisy code, but faster (and possibly easier to optimize for the query planner, too).

  • Use the aggregate FILTER clause in Postgres 9.4 or later. It's cleaner and a bit faster. See:

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Normally, the upper date would be exactly as you described (it would be less than the first day of the next month), but since it's based on user input (via a web-based report), it could be anything - I was just using this as an example. Thanks for your other suggestions. I'll try them out now. – dgwebb Mar 29 '19 at 19:37
  • Also just ran the percentage query you posted - impression is 26.6 percent and day is 26.4 percent. Very close. – dgwebb Mar 29 '19 at 19:57
  • 1
    @user1007918: Aha! We still did look in the right place. See added answer. – Erwin Brandstetter Mar 29 '19 at 20:07
  • Now to throw a wrench into everything - if I take any of the above queries and add something to the WHERE (such as further filtering by one of the other fields in the table - 'network' for example, they are MUCH slower). Since these fields are not needed in the result (I'd like the same fields in the output no matter what), is there a way to approach this to be able to incorporate additional fields in the WHERE? – dgwebb Mar 29 '19 at 20:16
  • @user1007918: Please make that a *new question* (with all relevant details). Comments are not the place ... – Erwin Brandstetter Mar 29 '19 at 20:18
  • Thanks - continued here: https://stackoverflow.com/questions/55425031/how-do-i-get-a-date-part-query-with-potentially-multiple-different-where-clauses – dgwebb Mar 29 '19 at 20:26