Trending sum over time

Question

I have a table (in Postgres 9.1) that looks something like this:

CREATE TABLE actions (
  user_id: INTEGER,
  date:    DATE,
  action:  VARCHAR(255),
  count:   INTEGER
)

For example:

    user_id    |    date    |     action   | count
---------------+------------+--------------+-------
             1 | 2013-01-01 | Email        |     1
             1 | 2013-01-02 | Call         |     3
             1 | 2013-01-03 | Email        |     3
             1 | 2013-01-04 | Call         |     2
             1 | 2013-01-04 | Voicemail    |     2
             1 | 2013-01-04 | Email        |     2
             2 | 2013-01-04 | Email        |     2

I would like to be able to view a user's total actions over time for a specific set of actions; for example, Calls + Emails:

  user_id  | date        |  count  
-----------+-------------+---------
         1 | 2013-01-01  |       1
         1 | 2013-01-02  |       4
         1 | 2013-01-03  |       7
         1 | 2013-01-04  |      11
         2 | 2013-01-04  |       2

The monstrosity that I've created so far looks like this:

SELECT
  date, user_id, SUM(count) OVER (PARTITION BY user_id ORDER BY date) AS count
FROM
  actions
WHERE
  action IN ('Call', 'Email') 
GROUP BY
  user_id, date, count;

Which works for single actions, but seems to break for multiple actions when they happen on the same day, for example instead of the expected 11 on 2013-01-04, we get 9:

    date    |      user_id | count
------------+--------------+-------
 2013-01-01 | 1            |     1
 2013-01-02 | 1            |     4
 2013-01-03 | 1            |     7
 2013-01-04 | 1            |     9 <-- should be 11?
 2013-01-04 | 2            |     2

Is it possible to tweak my query to resolve this issue? I tried removing the grouping on count, but Postgres doesn't seem to like that:

column "actions.count" must appear in the GROUP BY clause
or be used in an aggregate function
LINE 2:      date, user_id, SUM(count) OVER (PARTITION BY user...
                                ^

As always: your version of Postgres please. – Erwin Brandstetter Jul 22 '13 at 16:51 — Erwin Brandstetter, Jul 22 '13 at 16:51

score 1 · Answer 1 · answered Jul 22 '13 at 17:02

The table has a column named "count", and the expresion in the SELECT clause is aliased as "count", it is ambiguous.

Read documentation: http://www.postgresql.org/docs/9.0/static/sql-select.html#SQL-GROUPBY

In case of ambiguity, a GROUP BY name will be interpreted as an input-column name rather than an output column name.

That means, that your query does not group by "count" evaluated in the SELECT clause, but rather it groups by "count" values taken from the table.

This query gives expected results, see SQL Fiddle

SELECT date, user_id, count
from (
   Select date, user_id, 
          SUM(count) OVER (PARTITION BY user_id ORDER BY date) AS count
  FROM actions
  WHERE
    action IN ('Call', 'Email') 
) alias
GROUP BY
  user_id, date, count;

Your query is correct, your quote is correct, but the explanation covers only half the problem. Window functions are applied *after* aggregate functions. Therefore, referencing expressions with window functions is not allowed at all in `GROUP BY`. Your query avoids the problem by putting the window function into a subquery, but that's not *necessary*. You can have a window function over the result of an aggregate function in a single query level (as demonstrated in my answer), which is generally faster (2.5x in my test). — Erwin Brandstetter, Jul 22 '13 at 18:36

score 1 · Answer 2 · edited Jun 20 '20 at 09:12

Asserts

It is unclear whether you want to sort by user_id or date

It is also unclear whether you want to include dates in the result list, for which there is no row in the base table. In this case, refer to this closely related answer:
PostgreSQL: running count of rows for a query 'by minute'

Repair names

First off, I am using this test table instead of your problematic table:

CREATE TEMP TABLE actions (
  user_id integer,
  thedate    date,
  action  text,
  ct   integer
);

Your use of reserved words and function names as identifiers (column names) is part of the problem.

Repair query

Combine aggregate and window functions

Since aggregate functions are applied first, your original query lumps the two rows found for user_id = 1 and thedate = '2013-01-04' into one. You have to multiply by count(*) to get the actual running count.

You can do this without subquery, since you can combine aggregate functions and window functions. Aggregate functions are applied first. You can even have a window functions over the result of aggregate functions.

SELECT thedate
     , user_id
     , sum(ct * count(*)) OVER (PARTITION BY user_id
                                ORDER BY thedate) AS running_ct
FROM   actions
WHERE  action IN ('Call', 'Email') 
GROUP  BY user_id, thedate, ct
ORDER  BY user_id, thedate;

Or simplify to:

...
 , sum(sum(ct)) OVER (PARTITION BY user_id
                      ORDER BY thedate) AS running_ct
...

This should also be the fastest of the solutions presented.

Here, the inner sum() is an aggregate function, while the outer sum() is a window function - over the result of the aggregate function.

Or use `DISTINCT`

Another way would to use DISTINCT or DISTINCT ON, since that is applied after window functions:

DISTINCT - this is possible, since running_ct is guaranteed to be the same in this case anyway, since all peers are summed at once for the default frame definition of window functions.

SELECT DISTINCT
       thedate
     , user_id
     , sum(ct) OVER (PARTITION BY user_id ORDER BY thedate) AS running_ct
FROM   actions
WHERE  action IN ('Call', 'Email')
ORDER  BY thedate, user_id;

Or simplify with DISTINCT ON:

SELECT DISTINCT ON (thedate, user_id)
...

->SQLfiddle demonstrating all variants.

How exactly was the the use of a reserved word causing a problem? — Mark Stosberg, Jul 22 '13 at 17:22
@MarkStosberg: `count`is a reserved word in the SQL standard. The OP seems to be confused by the unfortunate use of `count` as function name and as column name. You shouldn't be using `count` as identifier in your answer, either. — Erwin Brandstetter, Jul 22 '13 at 17:30
Appreciate the information. Unfortunately I didn't create the table but I can see how that might cause problems. Can this be fixed by quoting the identifier? — John Ledbetter, Jul 22 '13 at 17:39
@JohnLedbetter: It can, but best avoid reserved words as identifiers to begin with. Quotes can be forgotten ... — Erwin Brandstetter, Jul 22 '13 at 17:41
I understand it's a /best practice/ to avoid reserved words, I just haven't spotted the actual problem being caused the issue in this case. — Mark Stosberg, Jul 22 '13 at 17:53
@MarkStosberg: Well in the case at hand, double-quoting would *not* help, since *other* problems dictate the outcome. Visibility in `ORDER BY` prefers *output* column names (as opposed to `GROUP BY`!). The SQL standard is rather confusing in this respect. More details in [this related answer](http://stackoverflow.com/questions/10906206/postgresql-how-to-return-rows-with-respect-to-a-found-row-relative-results/10910924#10910924). So, best avoid confusion to begin with. BTW. `sum(sum(ct))` is **2-3x faster** than `DISTINCT` in my test with 10k rows. — Erwin Brandstetter, Jul 22 '13 at 18:41

score 1 · Accepted Answer · answered Jul 22 '13 at 17:03

This query produces the result you are looking for:

SELECT DISTINCT   
  date, user_id, SUM(count) OVER (PARTITION BY user_id ORDER BY date) AS count 
  FROM actions
WHERE
  action IN ('Call', 'Email');

The default window is already what you want, according to the official docs and the "DISTINCT" eliminates duplicate rows when both Emails and Calls happen on the same day.

See SQL Fiddle.