3

Imagine an account table that looks like this:

   Column   |            Type             | Modifiers 
------------+-----------------------------+-----------
 id         | bigint                      | not null
 signupdate | timestamp without time zone | not null
 canceldate | timestamp without time zone | 

I want to get a report of the number of signups and cancellations by month.

It is pretty straight-forward to do it in two queries, one for the signups by month and then one for the cancellations by month. Is there an efficient way to do it in a single query? Some months may have zero signups and cancellations, and should show up with a zero in the results.

With source data like this:

id    signupDate     cancelDate
 1    2012-01-13     
 2    2012-01-15     2012-02-05
 3    2012-03-01     2012-03-20

we should get the following results:

Date      signups    cancellations    
2012-01         2                0
2012-02         0                1
2012-03         1                1

I'm using postgresql 9.0

Update after the first answer:

Craig Ringer provided a nice answer below. On my data set of approximately 75k records, the first and third examples performed similarly. The second example seems to have an error somewhere, it returned incorrect results.

Looking at the results from an explain analyze (and my table does have an index on signup_date), the first query returns:

Sort  (cost=2086062.39..2086062.89 rows=200 width=24) (actual time=863.831..863.833 rows=20 loops=1)
  Sort Key: m.m
  Sort Method:  quicksort  Memory: 26kB
  InitPlan 2 (returns $1)
    ->  Result  (cost=0.12..0.13 rows=1 width=0) (actual time=0.063..0.064 rows=1 loops=1)
          InitPlan 1 (returns $0)
            ->  Limit  (cost=0.00..0.12 rows=1 width=8) (actual time=0.040..0.040 rows=1 loops=1)
                  ->  Index Scan using account_created_idx on account  (cost=0.00..8986.92 rows=75759 width=8) (actual time=0.039..0.039 rows=1 loops=1)
                        Index Cond: (created IS NOT NULL)
  InitPlan 3 (returns $2)
    ->  Aggregate  (cost=2991.39..2991.40 rows=1 width=16) (actual time=37.108..37.108 rows=1 loops=1)
          ->  Seq Scan on account  (cost=0.00..2612.59 rows=75759 width=16) (actual time=0.008..14.102 rows=75759 loops=1)
  ->  HashAggregate  (cost=2083057.21..2083063.21 rows=200 width=24) (actual time=863.801..863.806 rows=20 loops=1)
        ->  Nested Loop  (cost=0.00..2077389.49 rows=755696 width=24) (actual time=37.238..805.333 rows=94685 loops=1)
              Join Filter: ((date_trunc('month'::text, a.created) = m.m) OR (date_trunc('month'::text, a.terminateddate) = m.m))
              ->  Function Scan on generate_series m  (cost=0.00..10.00 rows=1000 width=8) (actual time=37.193..37.197 rows=20 loops=1)
              ->  Materialize  (cost=0.00..3361.39 rows=75759 width=16) (actual time=0.004..11.916 rows=75759 loops=20)
                    ->  Seq Scan on account a  (cost=0.00..2612.59 rows=75759 width=16) (actual time=0.003..24.019 rows=75759 loops=1)
Total runtime: 872.183 ms

and the third query returns:

Sort  (cost=1199951.68..1199952.18 rows=200 width=8) (actual time=732.354..732.355 rows=20 loops=1)
  Sort Key: m.m
  Sort Method:  quicksort  Memory: 26kB
  InitPlan 4 (returns $2)
    ->  Result  (cost=0.12..0.13 rows=1 width=0) (actual time=0.030..0.030 rows=1 loops=1)
          InitPlan 3 (returns $1)
            ->  Limit  (cost=0.00..0.12 rows=1 width=8) (actual time=0.022..0.022 rows=1 loops=1)
                  ->  Index Scan using account_created_idx on account  (cost=0.00..8986.92 rows=75759 width=8) (actual time=0.022..0.022 rows=1 loops=1)
                        Index Cond: (created IS NOT NULL)
  InitPlan 5 (returns $3)
    ->  Aggregate  (cost=2991.39..2991.40 rows=1 width=16) (actual time=30.212..30.212 rows=1 loops=1)
          ->  Seq Scan on account  (cost=0.00..2612.59 rows=75759 width=16) (actual time=0.004..8.276 rows=75759 loops=1)
  ->  HashAggregate  (cost=12.50..1196952.50 rows=200 width=8) (actual time=65.226..732.321 rows=20 loops=1)
        ->  Function Scan on generate_series m  (cost=0.00..10.00 rows=1000 width=8) (actual time=30.262..30.264 rows=20 loops=1)
        SubPlan 1
          ->  Aggregate  (cost=2992.34..2992.35 rows=1 width=8) (actual time=21.098..21.098 rows=1 loops=20)
                ->  Seq Scan on account  (cost=0.00..2991.39 rows=379 width=8) (actual time=0.265..20.720 rows=3788 loops=20)
                      Filter: (date_trunc('month'::text, created) = $0)
        SubPlan 2
          ->  Aggregate  (cost=2992.34..2992.35 rows=1 width=8) (actual time=13.994..13.994 rows=1 loops=20)
                ->  Seq Scan on account  (cost=0.00..2991.39 rows=379 width=8) (actual time=2.363..13.887 rows=998 loops=20)
                      Filter: (date_trunc('month'::text, terminateddate) = $0)
Total runtime: 732.487 ms

This certainly makes it appear that the third query is faster, but when I run the queries from the command-line using the 'time' command, the first query is consistently faster, though only by a few milliseconds.

Surprisingly to me, running two separate queries (one to count signups and one to count cancellations) is significantly faster. It took less than half the time to run, ~300ms vs ~730ms. Of course that leaves more work to be done externally, but for my purposes it still might be the best solution. Here are the single queries:

select 
    m,
    count(a.id) as "signups"
from
    generate_series(
        (SELECT date_trunc('month',min(signup_date)) FROM accounts), 
        (SELECT date_trunc('month',greatest(max(signup_date),max(cancel_date))) FROM accounts), 
        interval '1 month') as m
INNER JOIN accounts a ON (date_trunc('month',a.signup_date) = m)
group by m
order by m 
;

select 
    m,
    count(a.id) as "cancellations"
from
    generate_series(
        (SELECT date_trunc('month',min(signup_date)) FROM accounts), 
        (SELECT date_trunc('month',greatest(max(signup_date),max(cancel_date))) FROM accounts), 
        interval '1 month') as m
INNER JOIN accounts a ON (date_trunc('month',a.cancel_date) = m)
group by m
order by m 
;

I have marked Craig's answer as correct, but if you can make it faster, I'd love to hear about it

Shad
  • 341
  • 3
  • 8

2 Answers2

3

Here are three different ways to do it. All depend on generating a time series then scanning it. One uses subqueries to aggregate data for each month. One joins the table twice against the series with different criteria. An alternate form does a single join on the time series, retaining rows that match either start or end date, then uses predicates in the counts to further filter the results.

EXPLAIN ANALYZE will help you pick which approach works best for your data.

http://sqlfiddle.com/#!12/99c2a/9

Test setup:

CREATE TABLE accounts
    ("id" int, "signup_date" timestamp, "cancel_date" timestamp);

INSERT INTO accounts
    ("id", "signup_date", "cancel_date")
VALUES
    (1, '2012-01-13 00:00:00', NULL),
    (2, '2012-01-15 00:00:00', '2012-02-05'),
    (3, '2012-03-01 00:00:00', '2012-03-20')
;

By single join and filter in count:

SELECT m, 
  count(nullif(date_trunc('month',a.signup_date) = m,'f')), 
  count(nullif(date_trunc('month',a.cancel_date) = m,'f'))
FROM generate_series(
  (SELECT date_trunc('month',min(signup_date)) FROM accounts),
  (SELECT date_trunc('month',greatest(max(signup_date),max(cancel_date))) FROM accounts),
  INTERVAL '1' MONTH
) AS m
INNER JOIN accounts a ON (date_trunc('month',a.signup_date) = m OR date_trunc('month',a.cancel_date) = m)
GROUP BY m
ORDER BY m;

By joining the accounts table twice:

SELECT m, count(s.signup_date) AS n_signups, count(c.cancel_date) AS n_cancels 
FROM generate_series( 
  (SELECT date_trunc('month',min(signup_date)) FROM accounts),
  (SELECT date_trunc('month',greatest(max(signup_date),max(cancel_date))) FROM accounts),
  INTERVAL '1' MONTH
) AS m LEFT OUTER JOIN accounts s ON (date_trunc('month',s.signup_date) = m) LEFT OUTER JOIN accounts c ON (date_trunc('month',c.cancel_date) = m)
GROUP BY m
ORDER BY m;

Alternately, using subqueries:

SELECT m, (
  SELECT count(signup_date) 
  FROM accounts 
  WHERE date_trunc('month',signup_date) = m
) AS n_signups, (
  SELECT count(signup_date)
  FROM accounts
  WHERE date_trunc('month',cancel_date) = m
)AS n_cancels 
FROM generate_series( 
  (SELECT date_trunc('month',min(signup_date)) FROM accounts),
  (SELECT date_trunc('month',greatest(max(signup_date),max(cancel_date))) FROM accounts),
  INTERVAL '1' MONTH
) AS m
GROUP BY m
ORDER BY m;
Craig Ringer
  • 307,061
  • 76
  • 688
  • 778
1

New answer after update.

I'm not shocked that you get better results from two simpler queries; sometimes it's simply more efficient to do things that way. However, there was an issue with my original answer that will've signficicantly impacted performance.

Erwin accurately pointed out in another answer that Pg can't use a simple b-tree index on a date with date_trunc, so you're better off using ranges. It can use an index created on the expression date_trunc('month',colname) but you're better off avoiding the creation of another unnecessary index.

Rephrasing the single-scan-and-filter query to use ranges produces:

SELECT m, 
  count(nullif(date_trunc('month',a.signup_date) = m,'f')), 
  count(nullif(date_trunc('month',a.cancel_date) = m,'f'))
FROM generate_series(
  (SELECT date_trunc('month',min(signup_date)) FROM accounts),
  (SELECT date_trunc('month',greatest(max(signup_date),max(cancel_date))) FROM accounts),
  INTERVAL '1' MONTH
) AS m
INNER JOIN accounts a ON (
  (a.signup_date >= m AND a.signup_date < m + INTERVAL '1' MONTH) 
  OR (a.cancel_date >= m AND a.cancel_date < m + INTERVAL '1' MONTH))
GROUP BY m
ORDER BY m;

There's no need to avoid date_trunc in non-indexable conditions, so I've only changed to use interval ranges in the join condition.

Where the original query used a seq scan and materialize, this now uses a bitmap index scan if there are indexes on signup_date and cancel_date.

In PostgreSQL 9.2 better performance may possibly be gained by adding:

CREATE INDEX account_signup_or_cancel ON accounts(signup_date,cancel_date);

and possibly:

CREATE INDEX account_signup_date_nonnull 
ON accounts(signup_date) WHERE (signup_date IS NOT NULL);

CREATE INDEX account_cancel_date_desc_nonnull 
ON accounts(cancel_date DESC) WHERE (cancel_date IS NOT NULL);

to allow index-only scans. It's hard to make solid index recommendations without the actual data to test with.

Alternately, the subquery based approach with improved indexable filter condition:

SELECT m, (
  SELECT count(signup_date) 
  FROM accounts 
  WHERE signup_date >= m AND signup_date < m + INTERVAL '1' MONTH
) AS n_signups, (
  SELECT count(cancel_date)
  FROM accounts
  WHERE cancel_date >= m AND cancel_date < m + INTERVAL '1' MONTH
) AS n_cancels 
FROM generate_series( 
  (SELECT date_trunc('month',min(signup_date)) FROM accounts),
  (SELECT date_trunc('month',greatest(max(signup_date),max(cancel_date))) FROM accounts),
  INTERVAL '1' MONTH
) AS m
GROUP BY m
ORDER BY m;

will benefit from ordinary b-tree indexes on signup_date and cancel_date, or from:

CREATE INDEX account_signup_date_nonnull 
ON accounts(signup_date) WHERE (signup_date IS NOT NULL);

CREATE INDEX account_cancel_date_nonnull 
ON accounts(cancel_date) WHERE (cancel_date IS NOT NULL);

Remember that every index you create imposes a penalty on INSERT and UPDATE performance, and competes with other indexes and help data for cache space. Try to create only indexes that make a big difference and are useful for other queries.

Community
  • 1
  • 1
Craig Ringer
  • 307,061
  • 76
  • 688
  • 778