Efficient subquery for aggregate time series

Question

I want to build a time series daily from a certain date and calculate a few statistics for each day. However this query is very slow... Any way to speed it up? (for example, select the table once in the subquery and compute various stats on that table for each day).

In code this would look like

for i, day in series:
    previous_days = series[0...i]
    some_calculation_a = some_operation_on(previous_days)
    some_calculation_b = some_other_operation_on(previous_days)

Here is an example for a time series looking for users with <= 5 messages up to that date:

with
days as
(
select date::Timestamp with time zone from generate_series('2015-07-09', 
  now(), '1 day'::interval) date
),

msgs as 
(
select days.date, 
    (select count(customer_id) from daily_messages where sum < 5 and date_trunc('day'::text, created_at) <= days.date) as LT_5,
    (select count(customer_id) from daily_messages where sum = 1 and date_trunc('day'::text, created_at) <= days.date) as EQ_1
from days, daily_messages
where date_trunc('day'::text, created_at) = days.date
group by days.date
)

select * from msgs;

Query breakdown:

CTE Scan on msgs  (cost=815579.03..815583.03 rows=200 width=24)
  Output: msgs.date, msgs.lt_5, msgs.eq_1
  CTE days
    ->  Function Scan on pg_catalog.generate_series date  (cost=0.01..10.01 rows=1000 width=8)
          Output: date.date
          Function Call: generate_series('2015-07-09 00:00:00+00'::timestamp with time zone, now(), '1 day'::interval)
  CTE msgs
    ->  Group  (cost=6192.62..815569.02 rows=200 width=8)
          Output: days.date, (SubPlan 2), (SubPlan 3)
          Group Key: days.date
          ->  Merge Join  (cost=6192.62..11239.60 rows=287970 width=8)
                Output: days.date
                Merge Cond: (days.date = (date_trunc('day'::text, daily_messages_2.created_at)))
                ->  Sort  (cost=69.83..72.33 rows=1000 width=8)
                      Output: days.date
                      Sort Key: days.date
                      ->  CTE Scan on days  (cost=0.00..20.00 rows=1000 width=8)
                            Output: days.date
                ->  Sort  (cost=6122.79..6266.78 rows=57594 width=8)
                      Output: daily_messages_2.created_at, (date_trunc('day'::text, daily_messages_2.created_at))
                      Sort Key: (date_trunc('day'::text, daily_messages_2.created_at))
                      ->  Seq Scan on public.daily_messages daily_messages_2  (cost=0.00..1568.94 rows=57594 width=8)
                            Output: daily_messages_2.created_at, date_trunc('day'::text, daily_messages_2.created_at)
          SubPlan 2
            ->  Aggregate  (cost=2016.89..2016.90 rows=1 width=32)
                  Output: count(daily_messages.customer_id)
                  ->  Seq Scan on public.daily_messages  (cost=0.00..2000.89 rows=6399 width=32)
                        Output: daily_messages.created_at, daily_messages.customer_id, daily_messages.day_total, daily_messages.sum, daily_messages.elapsed
                        Filter: ((daily_messages.sum < '5'::numeric) AND (date_trunc('day'::text, daily_messages.created_at) <= days.date))
          SubPlan 3
            ->  Aggregate  (cost=2001.13..2001.14 rows=1 width=32)
                  Output: count(daily_messages_1.customer_id)
                  ->  Seq Scan on public.daily_messages daily_messages_1  (cost=0.00..2000.89 rows=96 width=32)
                        Output: daily_messages_1.created_at, daily_messages_1.customer_id, daily_messages_1.day_total, daily_messages_1.sum, daily_messages_1.elapsed
                        Filter: ((daily_messages_1.sum = '1'::numeric) AND (date_trunc('day'::text, daily_messages_1.created_at) <= days.date))

Please post text as plain text, never as image. And consider instructions at: https://stackoverflow.com/tags/postgresql-performance/info — Erwin Brandstetter, Feb 27 '17 at 19:09

score 2 · Accepted Answer · edited May 23 '17 at 12:31

In addition to being very inefficient, I suspect the query is also incorrect. Assuming current Postgres 9.6, my educated guess:

SELECT created_at::date
     , sum(count(customer_id) FILTER (WHERE sum < 5)) OVER w AS lt_5
     , sum(count(customer_id) FILTER (WHERE sum = 1)) OVER w AS eq_1
FROM   daily_messages m
WHERE  created_at >= timestamptz '2015-07-09'  -- sargable!
AND    created_at <  now()                     -- probably redundant
GROUP  BY 1
WINDOW w AS (ORDER BY created_at::date);

All those correlated subqueries are probably not needed. I replaced it with window functions combined with aggregate FILTER clauses. You can have a window function over an aggregate function. Related answers with more explanation:

The CTEs don't help either (unnecessary overhead). You only would need a single subquery - or not even that, just the result from the set-returning function generate_series(). generate_series() can deliver timestamptz directly. Be aware of implications, though. You query depends on the time zone setting of the session. Details:

Ignoring timezones altogether in Rails and PostgreSQL

On second thought, I removed generate_series() completely. As long as you have an INNER JOIN to daily_messages, only days with actual rows remain in the result anyway. No need for generate_series() at all. Would make sense with LEFT JOIN. Not enough information in the question.

Include days without new entries in result

SELECT day
     , sum(lt_5_day) OVER w AS lt_5
     , sum(eq_1_day) OVER w AS eq_1
FROM  (
   SELECT day::date
   FROM   generate_series(date '2015-07-09', current_date, interval '1 day') day
   ) d
LEFT   JOIN (
   SELECT created_at::date AS day
        , count(customer_id) FILTER (WHERE sum < 5) AS lt_5_day
        , count(customer_id) FILTER (WHERE sum = 1) AS eq_1_day
   FROM   daily_messages m
   WHERE  created_at >= timestamptz '2015-07-09'
   GROUP  BY 1
   ) m USING (day)
WINDOW w AS (ORDER BY day);

Aggregate daily sums in subquery m.
Generate series of all days in time range in subquery d.
Use LEFT [OUTER] JOIN to retain all days in the result, even without new rows for the day.

Thanks! Need to unpack this to learn it better. I'm also getting an error 'm.customer_id must appear in the group by clause or be used in another aggregate fx' — William Falcon, Feb 27 '17 at 19:23
also, daily_messages is a materialized view. Would the index apply to that created_at or do i need another index on top of that view? — William Falcon, Feb 27 '17 at 19:25
My first draft had a bug. I replaced with a working query. A materialized view (unlike a view) is just another table for the purpose of this query. You need an index on daily_messages. You might be able to optimize further with partial or multicolumn indexes. There is not enough information in the question to tell. — Erwin Brandstetter, Feb 27 '17 at 19:58
really fast changes. Thank you. The only downside to this query is that it skips days if there isn't an entry there — William Falcon, Feb 28 '17 at 00:33
@WilliamFalcon: You might have said so to begin with. I added a solution. Please be more forthcoming next time. — Erwin Brandstetter, Feb 28 '17 at 03:29

Efficient subquery for aggregate time series

1 Answers1

Include days without new entries in result

Linked