0

Continuation of other question here:

How to get date_part query to hit index?

When executing the following query, it hits a compound index I created on the datelocal, views, impressions, gender, agegroup fields:

SELECT date_part('hour', datelocal)                AS hour
     , SUM(views) FILTER (WHERE gender = 'male')   AS male
     , SUM(views) FILTER (WHERE gender = 'female') AS female
FROM   reportimpression
WHERE  datelocal >= '2019-02-01' AND datelocal <  '2019-03-01'
GROUP  BY 1
ORDER  BY 1;

However, I'd like to be able to also filter this query down based on additional clauses in the WHERE, for example:

SELECT date_part('hour', datelocal)                AS hour
     , SUM(views) FILTER (WHERE gender = 'male')   AS male
     , SUM(views) FILTER (WHERE gender = 'female') AS female
FROM   reportimpression
WHERE  datelocal >= '2019-02-01' AND datelocal <  '2019-03-01'
AND network LIKE '%'
GROUP  BY 1
ORDER  BY 1;

This second query is MUCH slower than the first, although it should be operating on far fewer records, in addition to the fact that it doesn't hit my index.

Table schema:

CREATE TABLE reportimpression (
    datelocal timestamp without time zone,
    devicename text,
    network text,
    sitecode text,
    advertisername text,
    mediafilename text,
    gender text,
    agegroup text,
    views integer,
    impressions integer,
    dwelltime numeric
);

-- Indices -------------------------------------------------------

CREATE INDEX reportimpression_datelocal_index ON reportimpression(datelocal timestamp_ops);
CREATE INDEX reportimpression_viewership_index ON reportimpression(datelocal timestamp_ops,views int4_ops,impressions int4_ops,gender text_ops,agegroup text_ops);
CREATE INDEX reportimpression_test_index ON reportimpression(datelocal timestamp_ops,(date_part('hour'::text, datelocal)) float8_ops);

Analyze output:

Finalize GroupAggregate  (cost=1005368.37..1005385.70 rows=3151 width=24) (actual time=70615.636..70615.649 rows=24 loops=1)
  Group Key: (date_part('hour'::text, datelocal))
  ->  Sort  (cost=1005368.37..1005369.94 rows=3151 width=24) (actual time=70615.631..70615.634 rows=48 loops=1)
        Sort Key: (date_part('hour'::text, datelocal))
        Sort Method: quicksort  Memory: 28kB
        ->  Gather  (cost=1005005.62..1005331.75 rows=3151 width=24) (actual time=70615.456..70641.208 rows=48 loops=1)
              Workers Planned: 1
              Workers Launched: 1
              ->  Partial HashAggregate  (cost=1004005.62..1004016.65 rows=3151 width=24) (actual time=70613.132..70613.152 rows=24 loops=2)
                    Group Key: date_part('hour'::text, datelocal)
                    ->  Parallel Seq Scan on reportimpression  (cost=0.00..996952.63 rows=2821195 width=17) (actual time=0.803..69876.914 rows=2429159 loops=2)
                          Filter: ((datelocal >= '2019-02-01 00:00:00'::timestamp without time zone) AND (datelocal < '2019-03-01 00:00:00'::timestamp without time zone) AND (network ~~ '%'::text))
                          Rows Removed by Filter: 6701736
Planning time: 0.195 ms
Execution time: 70641.349 ms

Do I need to create additional indexes, tweak my SELECT, or something else entirely?

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
dgwebb
  • 321
  • 3
  • 16

1 Answers1

1

Your added predicate uses the LIKE operator:

AND network LIKE '%'

The actual query plan depends on what you pass instead of '%'. But, generally, plain btree indexes are useless for this. You'll need a trigram index or use the text search infrastructure or similar, depending on what patterns you might be looking for.

See:

You might even combine multiple indexing strategies. Example:


If that's supposed to be:

AND network = '<input_string>'

then, by all means, actually use the = operator, not LIKE. Reasons in ascending order of importance:

  1. shorter
  2. less confusing
  3. makes the job for the Postgres planner simpler (very slightly cheaper)
  4. correct

If you pass a string with special characters inadvertently, you might get incorrect results. See:

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • I used % as an example, but assume that that would be an exact value from that field in the table. It could use "=" if that would help things. – dgwebb Mar 29 '19 at 20:33
  • Possible patterns would be specific string values from that field. For example, "Network Name", and this could be set to look for an exact match. – dgwebb Mar 29 '19 at 20:38
  • I tried adding a multi-column index that includes all of the potential columns that the WHERE might include (in order of how they would be included) and that seemed to do it. Query is now hitting the index and is much faster. – dgwebb Mar 29 '19 at 21:18
  • 1
    Ah, ok. Guess I read your first comment too hastily. It clarifies quite a bit. I suggest you add the index you arrived at in the question. Might be tweaked, yet. – Erwin Brandstetter Mar 29 '19 at 21:34
  • Thanks! I suspect my queries are still quite inefficient (just ignorance and inexperience on my part), but at this point, I'm not quite sure how to get the same result with something more performant. – dgwebb Mar 29 '19 at 22:02
  • That additional condition `network LIKE '%'` is obviously identical to `network IS NOT NULL`. The question that remains is: why doesn't PostgreSQL choose an index scan on `reportimpression_datelocal_index` with the `LIKE` condition as filter? Could it be that most rows have NULL in `network` and the condition on `datelocal` isn't selective? In that case a (partial?) index on `network` might help. – Laurenz Albe Mar 30 '19 at 04:12
  • All comparisons in the WHERE were changed to "=". Minimal impact on performance. The bigger issue seems to be with the base query (not including any of the additional WHERE clauses) or the indexes. Think I can work through the WHERE clauses once I get the root of the issue corrected. Continued here: https://stackoverflow.com/questions/55540605/how-do-i-improve-date-based-query-performance-on-a-large-table – dgwebb Apr 05 '19 at 19:00