Imagine an account table that looks like this:
Column | Type | Modifiers
------------+-----------------------------+-----------
id | bigint | not null
signupdate | timestamp without time zone | not null
canceldate | timestamp without time zone |
I want to get a report of the number of signups and cancellations by month.
It is pretty straight-forward to do it in two queries, one for the signups by month and then one for the cancellations by month. Is there an efficient way to do it in a single query? Some months may have zero signups and cancellations, and should show up with a zero in the results.
With source data like this:
id signupDate cancelDate
1 2012-01-13
2 2012-01-15 2012-02-05
3 2012-03-01 2012-03-20
we should get the following results:
Date signups cancellations
2012-01 2 0
2012-02 0 1
2012-03 1 1
I'm using postgresql 9.0
Update after the first answer:
Craig Ringer provided a nice answer below. On my data set of approximately 75k records, the first and third examples performed similarly. The second example seems to have an error somewhere, it returned incorrect results.
Looking at the results from an explain analyze (and my table does have an index on signup_date), the first query returns:
Sort (cost=2086062.39..2086062.89 rows=200 width=24) (actual time=863.831..863.833 rows=20 loops=1)
Sort Key: m.m
Sort Method: quicksort Memory: 26kB
InitPlan 2 (returns $1)
-> Result (cost=0.12..0.13 rows=1 width=0) (actual time=0.063..0.064 rows=1 loops=1)
InitPlan 1 (returns $0)
-> Limit (cost=0.00..0.12 rows=1 width=8) (actual time=0.040..0.040 rows=1 loops=1)
-> Index Scan using account_created_idx on account (cost=0.00..8986.92 rows=75759 width=8) (actual time=0.039..0.039 rows=1 loops=1)
Index Cond: (created IS NOT NULL)
InitPlan 3 (returns $2)
-> Aggregate (cost=2991.39..2991.40 rows=1 width=16) (actual time=37.108..37.108 rows=1 loops=1)
-> Seq Scan on account (cost=0.00..2612.59 rows=75759 width=16) (actual time=0.008..14.102 rows=75759 loops=1)
-> HashAggregate (cost=2083057.21..2083063.21 rows=200 width=24) (actual time=863.801..863.806 rows=20 loops=1)
-> Nested Loop (cost=0.00..2077389.49 rows=755696 width=24) (actual time=37.238..805.333 rows=94685 loops=1)
Join Filter: ((date_trunc('month'::text, a.created) = m.m) OR (date_trunc('month'::text, a.terminateddate) = m.m))
-> Function Scan on generate_series m (cost=0.00..10.00 rows=1000 width=8) (actual time=37.193..37.197 rows=20 loops=1)
-> Materialize (cost=0.00..3361.39 rows=75759 width=16) (actual time=0.004..11.916 rows=75759 loops=20)
-> Seq Scan on account a (cost=0.00..2612.59 rows=75759 width=16) (actual time=0.003..24.019 rows=75759 loops=1)
Total runtime: 872.183 ms
and the third query returns:
Sort (cost=1199951.68..1199952.18 rows=200 width=8) (actual time=732.354..732.355 rows=20 loops=1)
Sort Key: m.m
Sort Method: quicksort Memory: 26kB
InitPlan 4 (returns $2)
-> Result (cost=0.12..0.13 rows=1 width=0) (actual time=0.030..0.030 rows=1 loops=1)
InitPlan 3 (returns $1)
-> Limit (cost=0.00..0.12 rows=1 width=8) (actual time=0.022..0.022 rows=1 loops=1)
-> Index Scan using account_created_idx on account (cost=0.00..8986.92 rows=75759 width=8) (actual time=0.022..0.022 rows=1 loops=1)
Index Cond: (created IS NOT NULL)
InitPlan 5 (returns $3)
-> Aggregate (cost=2991.39..2991.40 rows=1 width=16) (actual time=30.212..30.212 rows=1 loops=1)
-> Seq Scan on account (cost=0.00..2612.59 rows=75759 width=16) (actual time=0.004..8.276 rows=75759 loops=1)
-> HashAggregate (cost=12.50..1196952.50 rows=200 width=8) (actual time=65.226..732.321 rows=20 loops=1)
-> Function Scan on generate_series m (cost=0.00..10.00 rows=1000 width=8) (actual time=30.262..30.264 rows=20 loops=1)
SubPlan 1
-> Aggregate (cost=2992.34..2992.35 rows=1 width=8) (actual time=21.098..21.098 rows=1 loops=20)
-> Seq Scan on account (cost=0.00..2991.39 rows=379 width=8) (actual time=0.265..20.720 rows=3788 loops=20)
Filter: (date_trunc('month'::text, created) = $0)
SubPlan 2
-> Aggregate (cost=2992.34..2992.35 rows=1 width=8) (actual time=13.994..13.994 rows=1 loops=20)
-> Seq Scan on account (cost=0.00..2991.39 rows=379 width=8) (actual time=2.363..13.887 rows=998 loops=20)
Filter: (date_trunc('month'::text, terminateddate) = $0)
Total runtime: 732.487 ms
This certainly makes it appear that the third query is faster, but when I run the queries from the command-line using the 'time' command, the first query is consistently faster, though only by a few milliseconds.
Surprisingly to me, running two separate queries (one to count signups and one to count cancellations) is significantly faster. It took less than half the time to run, ~300ms vs ~730ms. Of course that leaves more work to be done externally, but for my purposes it still might be the best solution. Here are the single queries:
select
m,
count(a.id) as "signups"
from
generate_series(
(SELECT date_trunc('month',min(signup_date)) FROM accounts),
(SELECT date_trunc('month',greatest(max(signup_date),max(cancel_date))) FROM accounts),
interval '1 month') as m
INNER JOIN accounts a ON (date_trunc('month',a.signup_date) = m)
group by m
order by m
;
select
m,
count(a.id) as "cancellations"
from
generate_series(
(SELECT date_trunc('month',min(signup_date)) FROM accounts),
(SELECT date_trunc('month',greatest(max(signup_date),max(cancel_date))) FROM accounts),
interval '1 month') as m
INNER JOIN accounts a ON (date_trunc('month',a.cancel_date) = m)
group by m
order by m
;
I have marked Craig's answer as correct, but if you can make it faster, I'd love to hear about it