76

I am using count and group by to get the number of subscribers registered each day:

  SELECT created_at, COUNT(email)  
    FROM subscriptions 
GROUP BY created at;

Result:

created_at  count
-----------------
04-04-2011  100
05-04-2011   50
06-04-2011   50
07-04-2011  300

I want to get the cumulative total of subscribers every day instead. How do I get this?

created_at  count
-----------------
04-04-2011  100
05-04-2011  150
06-04-2011  200
07-04-2011  500
OMG Ponies
  • 325,700
  • 82
  • 523
  • 502
khairul
  • 1,230
  • 2
  • 12
  • 19

6 Answers6

127

With larger datasets, window functions are the most efficient way to perform these kinds of queries -- the table will be scanned only once, instead of once for each date, like a self-join would do. It also looks a lot simpler. :) PostgreSQL 8.4 and up have support for window functions.

This is what it looks like:

SELECT created_at, sum(count(email)) OVER (ORDER BY created_at)
FROM subscriptions
GROUP BY created_at;

Here OVER creates the window; ORDER BY created_at means that it has to sum up the counts in created_at order.


Edit: If you want to remove duplicate emails within a single day, you can use sum(count(distinct email)). Unfortunately this won't remove duplicates that cross different dates.

If you want to remove all duplicates, I think the easiest is to use a subquery and DISTINCT ON. This will attribute emails to their earliest date (because I'm sorting by created_at in ascending order, it'll choose the earliest one):

SELECT created_at, sum(count(email)) OVER (ORDER BY created_at)
FROM (
    SELECT DISTINCT ON (email) created_at, email
    FROM subscriptions ORDER BY email, created_at
) AS subq
GROUP BY created_at;

If you create an index on (email, created_at), this query shouldn't be too slow either.


(If you want to test, this is how I created the sample dataset)

create table subscriptions as
   select date '2000-04-04' + (i/10000)::int as created_at,
          'foofoobar@foobar.com' || (i%700000)::text as email
   from generate_series(1,1000000) i;
create index on subscriptions (email, created_at);
intgr
  • 19,834
  • 5
  • 59
  • 69
  • This is great intgr, only that my subscriptions table contains a lot of duplicate email rows. So what `over` is doing is `sum`-ming the `count` numbers, but I still need to recalculate the unique emails on every subsequent date. – khairul Apr 18 '11 at 09:53
  • I updated my answer with a `DISTINCT ON` subquery. It's still a lot faster than Andriy's answer -- can process a million rows within a few seconds -- but perhaps more complicated. – intgr Apr 18 '11 at 13:55
  • Nice tip on the generate_series function! – Endy Tjahjono Apr 19 '11 at 06:33
  • 2
    Note that `DISTINCT ON` can also be turned into an equivalent query with `GROUP BY`; in this case, `SELECT email, MIN(created_at) as created_at FROM subscriptions GROUP BY email`. Which is more efficient will probably vary, although ready-sorted sub-query from the `DISTINCT ON` seems to give some advantage to the sort needed by the Window function. – IMSoP May 29 '13 at 11:51
  • I'd like to have this on a per month basis, how would I need to change this query? I'm having real problems with that. – herrherr Jan 22 '18 at 06:50
  • To anyone reading this answer, see the answer that uses `rollup` posted by @Pstr. In 2021+ `rollup` seems to be the way to go (more performant). – Little Code Sep 03 '21 at 14:06
9

Use:

SELECT a.created_at,
       (SELECT COUNT(b.email)
          FROM SUBSCRIPTIONS b
         WHERE b.created_at <= a.created_at) AS count
  FROM SUBSCRIPTIONS a
OMG Ponies
  • 325,700
  • 82
  • 523
  • 502
2
SELECT
  s1.created_at,
  COUNT(s2.email) AS cumul_count
FROM subscriptions s1
  INNER JOIN subscriptions s2 ON s1.created_at >= s2.created_at
GROUP BY s1.created_at
Andriy M
  • 76,112
  • 17
  • 94
  • 154
  • I've tried the `sum(s2.count)` and the console gives an error: 'aggregate function calls cannot be nested' – khairul Apr 18 '11 at 06:43
  • I meant that to be `COUNT(s2.email)`, sorry. Please see my edited solution. – Andriy M Apr 18 '11 at 06:49
  • Thanks buddy! I was working with a more complicated query, and your structure is easy to understand (and thus, to implement). – khairul Apr 18 '11 at 07:20
2

I assume you want only one row per day and you want to still show days without any subscriptions (suppose nobody subscribes for a certain date, do you want to show that date with the balance of the previous day?). If this is the case, you can use the 'with' feature:

with recursive serialdates(adate) as (
    select cast('2011-04-04' as date)
    union all
    select adate + 1 from serialdates where adate < cast('2011-04-07' as date)
)
select D.adate,
(
    select count(distinct email)
    from subscriptions
    where created_at between date_trunc('month', D.adate) and D.adate
)
from serialdates D
Endy Tjahjono
  • 24,120
  • 23
  • 83
  • 123
  • Thanks, that `with` function could be useful too. Learnt something new. – khairul Apr 18 '11 at 07:53
  • 2
    Instead of serialdates you can use the built-in function: `generate_series(timestamp '2011-04-04', timestamp '2011-04-07', interval '1 day')` – intgr Apr 18 '11 at 13:53
2

To anyone seeing this answer today (2021) You can use rollup

SELECT created_at, COUNT(email)  
    FROM subscriptions 
GROUP BY rollup(created_at);

this will give you a new row with the total

created_at  count
-----------------
04-04-2011  100
05-04-2011   50
06-04-2011   50
07-04-2011  300
NULL        500

You can also use rollup for partial results if you have more than one parameter to show in your group by. If you have a created_by for instance:

SELECT created_at, created_by COUNT(email)  
    FROM subscriptions 
GROUP BY rollup(created_at, created_by);

this will give you a new row with the total

created_at  created_by  count
-----------------------------
04-04-2011     1        80
04-04-2011     2        20
04-04-2021    NULL      100
05-04-2011     1        20
05-04-2011     2        30
05-04-2011    NULL      50
NULL          NULL      150

I only took the numbers of the first two days, but that's the idea. it will show grouped by date, then total of that day, then the total of totals.

Order matters in the rollup() here, as to how the partial totals will be displayed

Pstr
  • 777
  • 1
  • 8
  • 17
-3

The best way is to have a calendar table: calendar ( date date, month int, quarter int, half int, week int, year int )

Then, you can join this table to make summary for the field you need.

mentat
  • 1