Select distinct users group by time range

Question

I have a table with the following info

 |date | user_id | week_beg | month_beg|

SQL to create table with test values:

CREATE TABLE uniques
(
  date DATE,
  user_id INT,
  week_beg DATE,
  month_beg DATE
)
INSERT INTO uniques VALUES ('2013-01-01', 1, '2012-12-30', '2013-01-01')
INSERT INTO uniques VALUES ('2013-01-03', 3, '2012-12-30', '2013-01-01')
INSERT INTO uniques VALUES ('2013-01-06', 4, '2013-01-06', '2013-01-01')
INSERT INTO uniques VALUES ('2013-01-07', 4, '2013-01-06', '2013-01-01')

INPUT TABLE:

 | date       | user_id     | week_beg   | month_beg  |    
 | 2013-01-01 | 1           | 2012-12-30 | 2013-01-01 |    
 | 2013-01-03 | 3           | 2012-12-30 | 2013-01-01 |    
 | 2013-01-06 | 4           | 2013-01-06 | 2013-01-01 |    
 | 2013-01-07 | 4           | 2013-01-06 | 2013-01-01 |

OUTPUT TABLE:

 | date       | time_series | cnt        |                 
 | 2013-01-01 | D           | 1          |                 
 | 2013-01-01 | W           | 1          |                 
 | 2013-01-01 | M           | 1          |                 
 | 2013-01-03 | D           | 1          |                 
 | 2013-01-03 | W           | 2          |                 
 | 2013-01-03 | M           | 2          |                 
 | 2013-01-06 | D           | 1          |                 
 | 2013-01-06 | W           | 1          |                 
 | 2013-01-06 | M           | 3          |                 
 | 2013-01-07 | D           | 1          |                 
 | 2013-01-07 | W           | 1          |                 
 | 2013-01-07 | M           | 3          |

I want to calculate the number of distinct user_id's for a date:

For that date
For that week up to that date (Week to date)
For the month up to that date (Month to date)

1 is easy to calculate. For 2 and 3 I am trying to use such queries:

SELECT
  date,
  'W' AS "time_series",
  (COUNT DISTINCT user_id) COUNT (user_id) OVER (PARTITION BY week_beg) AS "cnt"
  FROM user_subtitles

SELECT
  date,
  'M' AS "time_series",
  (COUNT DISTINCT user_id) COUNT (user_id) OVER (PARTITION BY month_beg) AS "cnt"
  FROM user_subtitles

Postgres does not allow window functions for DISTINCT calculation, so this approach does not work.

I have also tried out a GROUP BY approach, but it does not work as it gives me numbers for whole week/months.

Whats the best way to approach this problem?

Please share some input data and its expected output – Akash Apr 17 '13 at 03:18 — Akash, Apr 17 '13 at 03:18

score 3 · Accepted Answer · edited May 23 '17 at 11:43

Count all rows

SELECT date, '1_D' AS time_series,  count(DISTINCT user_id) AS cnt
FROM   uniques
GROUP  BY 1

UNION  ALL
SELECT DISTINCT ON (1)
       date, '2_W', count(*) OVER (PARTITION BY week_beg ORDER BY date)
FROM   uniques

UNION  ALL
SELECT DISTINCT ON (1)
       date, '3_M', count(*) OVER (PARTITION BY month_beg ORDER BY date)
FROM   uniques
ORDER  BY 1, time_series

Your columns week_beg and month_beg are 100 % redundant and can easily be replaced by date_trunc('week', date + 1) - 1 and date_trunc('month', date) respectively.
Your week seems to start on Sunday (off by one), therefore the + 1 .. - 1.
The default frame of a window function with ORDER BY in the OVER clause uses is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. That's exactly what you need.
Use UNION ALL, not UNION.
Your unfortunate choice for time_series (D, W, M) does not sort well, I renamed to make the final ORDER BY easier.
This query can deal with multiple rows per day. Counts include all peers for a day.
More about DISTINCT ON:
- Select first row in each GROUP BY group?

DISTINCT users per day

To count every user only once per day, use a CTE with DISTINCT ON:

WITH x AS (SELECT DISTINCT ON (1,2) date, user_id FROM uniques)
SELECT date, '1_D' AS time_series,  count(user_id) AS cnt
FROM   x
GROUP  BY 1

UNION ALL
SELECT DISTINCT ON (1)
       date, '2_W'
      ,count(*) OVER (PARTITION BY (date_trunc('week', date + 1)::date - 1)
                      ORDER BY date)
FROM   x

UNION ALL
SELECT DISTINCT ON (1)
       date, '3_M'
      ,count(*) OVER (PARTITION BY date_trunc('month', date) ORDER BY date)
FROM   x
ORDER BY 1, 2

DISTINCT users over dynamic period of time

You can always resort to correlated subqueries. Tend to be slow with big tables!
Building on the previous queries:

WITH du AS (SELECT date, user_id FROM uniques GROUP BY 1,2)
    ,d  AS (
    SELECT date
          ,(date_trunc('week', date + 1)::date - 1) AS week_beg
          ,date_trunc('month', date)::date AS month_beg
    FROM   uniques
    GROUP  BY 1
    )
SELECT date, '1_D' AS time_series,  count(user_id) AS cnt
FROM   du
GROUP  BY 1

UNION ALL
SELECT date, '2_W', (SELECT count(DISTINCT user_id) FROM du
                     WHERE  du.date BETWEEN d.week_beg AND d.date )
FROM   d
GROUP  BY date, week_beg

UNION ALL
SELECT date, '3_M', (SELECT count(DISTINCT user_id) FROM du
                     WHERE  du.date BETWEEN d.month_beg AND d.date)
FROM   d
GROUP  BY date, month_beg
ORDER  BY 1,2;

SQL Fiddle for all three solutions.

Faster with `dense_rank()`

@Clodoaldo came up with a major improvement: use the window function dense_rank(). Here is another idea for an optimized version. It should be even faster to exclude daily duplicates right away. The performance gain grows with the number of rows per day.

Building on a simplified and sanitized data model - without the redundant columns - day as column name instead of date

date is a reserved word in standard SQL and a basic type name in PostgreSQL and shouldn't be used as identifier.

CREATE TABLE uniques(
   day date     -- instead of "date"
  ,user_id int
);

Improved query:

WITH du AS (
   SELECT DISTINCT ON (1, 2)
          day, user_id 
         ,date_trunc('week',  day + 1)::date - 1 AS week_beg
         ,date_trunc('month', day)::date         AS month_beg
   FROM   uniques
   )
SELECT day, count(user_id) AS d, max(w) AS w, max(m) AS m
FROM  (
    SELECT user_id, day
          ,dense_rank() OVER(PARTITION BY week_beg  ORDER BY user_id) AS w
          ,dense_rank() OVER(PARTITION BY month_beg ORDER BY user_id) AS m
    FROM   du
    ) s
GROUP  BY day
ORDER  BY day;

SQL Fiddle demonstrating the performance of 4 faster variants. It depends on your data distribution which is fastest for you.
All of them are about 10x as fast as the correlated subqueries version (which isn't bad for correlated subqueries).

Thanks @Erwin for your answer. It solves the problem partially as it does not count distinct user_id for M and W. I have updated the test data in my question to catch that. — ishan, Apr 17 '13 at 05:32
Yes I add the week_beg and month_beg columns from a previos table (the same way as you mentioned) to make the group by easier. — ishan, Apr 17 '13 at 05:36
As it says in the documentation "Aggregate window functions, unlike normal aggregate functions, do not allow DISTINCT or ORDER BY to be used within the function argument list." — ishan, Apr 17 '13 at 05:42
@ishan: That's right, no `DISTINCT` inside window functions. But you can do it *before* you apply window functions. I added a solution. — Erwin Brandstetter, Apr 17 '13 at 05:47
There is another problem with the solution 'DISTINCT users per day'. When I get the numbers for a M, i need distinct users from the month beginning to the date. This recounts the user_id for different days, but when you take the whole month, these user_id's should not be double counted. Thanks a lot for your help @Erwin — ishan, Apr 17 '13 at 05:56
@ishan: I see, this is a tricky problem. You want a distinct count over a dynamic period of time. Best I can think of are correlated subqueries. A rare case. — Erwin Brandstetter, Apr 17 '13 at 06:33
I am taking a ruby based approached to inject the counts based on the time ranges. Thanks for your help, I learned a great deal about postgres while taking the postgres way — ishan, Apr 17 '13 at 06:41
@ishan: Correlated subqueries are no beauty queens. But I would still be very surprised if this wasn't *much* faster than any script solution outside of the db. Note: I fixed an off-by-one error with my calculation of `week_beg`. — Erwin Brandstetter, Apr 17 '13 at 16:25

Clodoaldo Neto · Answer 2 · 2013-04-17T14:36:59.417

2

Without correlated subqueries. SQL Fiddle

with u as (
    select
        "date", user_id,
        date_trunc('week', "date" + 1)::date - 1 week_beg,
        date_trunc('month', "date")::date month_beg
    from uniques
)
select
    "date", count(distinct user_id) D,
    max(week_dr) W, max(month_dr) M
from (
    select
        user_id, "date",
        dense_rank() over(partition by week_beg order by user_id) week_dr,
        dense_rank() over(partition by month_beg order by user_id) month_dr
    from u
    ) s
group by "date"
order by "date"

edited Apr 17 '13 at 14:36

answered Apr 17 '13 at 13:49

Clodoaldo Neto

118,695
26
233
260

+1 Excellent. I was experimenting with `dense_rank()` but ran out of time. – Erwin Brandstetter Apr 17 '13 at 16:32
I think I found another improvement and added it to my answer, plus test case. You may be interested. – Erwin Brandstetter Apr 17 '13 at 18:24

score 0 · Answer 3 · answered Apr 17 '13 at 04:08

0

Try

SELECT
  * 
FROM 
(
  SELECT dates, count(user_id), 'D' as timesereis FROM users_data GROUP BY dates
  UNION
  SELECT max(dates), count(user_id), 'W' FROM users_data GROUP BY date_part('year',dates)+date_part('week',dates)
  UNION
  SELECT max(dates), count(user_id), 'M' FROM users_data GROUP BY date_part('year',dates)+date_part('week',dates)
) tEMP order by dates, timesereis

SQLFIDDLE

answered Apr 17 '13 at 04:08

Akash

4,956
11
42
70

it does not give me D,W,M values for all dates. I want all these rows – ishan Apr 17 '13 at 05:08
the problem is I cannot do a group by without including 'date' – ishan Apr 17 '13 at 05:16

score -1 · Answer 4 · answered Apr 17 '13 at 04:05

-1

Try queries like this

SELECT count(distinct user_id), date_format(date, '%Y-%m-%d') as date_period
FROM uniques
GROUP By date_period

answered Apr 17 '13 at 04:05

georgecj11

1,600
15
22

1

I am working on postgresql. But anyways how does this solve my problem? – ishan Apr 17 '13 at 05:16

Select distinct users group by time range

4 Answers4

Count all rows

DISTINCT users per day

DISTINCT users over dynamic period of time

Faster with `dense_rank()`

Linked

Select distinct users group by time range

4 Answers4

Count all rows

DISTINCT users per day

DISTINCT users over dynamic period of time

Faster with dense_rank()

Linked

Faster with `dense_rank()`