Cumulative distinct count

Question

I am working on query to get cumulative distinct count of uids on daily basis.

Example : Say there are 2 uids (100,200) appeared on date 2016-11-01 and they also appeared on next day with new uid 300 (100,200,300) on 2016-11-02 At this point i want store cumulative count to be 3 not 5 as (user id 100 and 200 already appeared on past day ).

Input table:

    date            uid         
2016-11-01          100
2016-11-01          200
2016-11-01          300
2016-11-01          400         
2016-11-02          100
2016-11-02          200                 
2016-11-03          300
2016-11-03          400
2016-11-03          500
2016-11-03          600
2016-11-04          700

Expected query result:

date            daily_cumulative_count
2016-11-01              4   
2016-11-02              4
2016-11-03              6
2016-11-04              7

Till now i am able to get cumulative distinct count per day but it includes previous distinct uids from previous day as well.

SELECT 
  date, 
  SUM(count) OVER (
    ORDER BY date ASC 
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  )
FROM (
  SELECT 
    date, 
    COUNT(DISTINCT uid) AS count
  FROM sample_table
  GROUP by 1
)ORDER BY date DESC;

Any kind of help would be greatly appreciated.

score 14 · Answer 1 · answered May 15 '18 at 09:40

14

easiest way:

SELECT *, count(*) over (order by fst_date ) cum_uids
  FROM (
SELECT uid, min(date) fst_date FROM t GROUP BY uid
 ) t

or something like this

answered May 15 '18 at 09:40

Stepan BLR

159
1
3

1

This is a very simple solution and works even on spark-sql. Thank you very much! – Henrique Florencio Feb 13 '20 at 11:45
1

I got a table with multiple rows for the same date, so did `SELECT DISTINCT *, ...` – Harm Oct 20 '21 at 12:46

score 12 · Answer 2 · answered Sep 28 '18 at 09:02

12

WITH firstseen AS (
  SELECT uid, MIN(date) date
  FROM sample_table
  GROUP BY 1
)
SELECT DISTINCT date, COUNT(uid) OVER (ORDER BY date) daily_cumulative_count 
FROM firstseen
ORDER BY 1

Using SELECT DISTINCT because (date, COUNT(uid)) will be duplicated many times.

Explanation: for each date dt, it counts uid from the earliest date up to dt, because we are specifying ORDER BY date and it defaults to BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.

answered Sep 28 '18 at 09:02

cakraww

2,493
28
30

But wouldn't this skip the 2016-11-02 in the output because there were no users who had their minimum date = 2016-11-02? – Vash Apr 13 '23 at 23:43

score 8 · Answer 3 · answered Nov 09 '16 at 21:14

8

You can use exists to check if an id was present on any of the previous dates. Then get the running sum and find the max value for each group which would get you the daily distinct cumulative count.

select dt, max(col) as daily_cumulative_count
from (select t1.*, 
      sum(case when not exists (select 1 from t where t1.dt > dt and id = t1.uid) then 1 else 0 end) over(order by dt) col
      from t t1) x 
group by dt

answered Nov 09 '16 at 21:14

Vamsi Prabhala

48,685
4
36
58

This is the code that provides the solution for the original question. ✅ - Thank you! – andexte Nov 15 '22 at 09:02

score 2 · Answer 4 · answered Apr 28 '17 at 08:29

Please try the following...

SELECT date AS date
       COUNT( uid ) AS daily_cumulative_count
FROM ( SELECT leftTable.date AS date,
              rightTable.uid AS uid
       FROM sample_table AS leftTable
       JOIN sample_table AS rightTable ON leftTable.date >= rightTable.date
       GROUP BY leftTable.date,
                rightTable.uid
     ) AS allUIDSForDateFinder
GROUP BY date;

This statement starts by joining one instance of sample_table to another in such a way that each record in leftTable has associated with it a copy of each record from rightTable that has an earlier or equal date value. This effectively attaches a list to each date of all uid values that have occurred up to and including that date value.

The resulting dataset is refined to unique date and uid combinations through use of GROUP BY.

The refined dataset from the subquery allUIDSForDateFinder is then grouped by date by the main body of the query, and a COUNT() of uid values associated with each group is performed.

If you have any questions or comments, then please feel free to post a Comment accordingly.

score 1 · Answer 5 · answered Feb 08 '22 at 12:43

A version similar to @stepan-blr but with the final result you are looking for

Version WITH:

WITH t as (
    SELECT uid
           , min(dt) fst_date 
    FROM input_table 
    GROUP BY uid
)
SELECT DISTINCT fst_date
                , count(uid) over (order by fst_date ) daily_cumulative_count
FROM t

Version SELECT FROM SELECT:

SELECT DISTINCT fst_date
                , count(uid) over (order by fst_date ) daily_cumulative_count
FROM (
    SELECT uid
           , min(dt) fst_date 
    FROM input_table 
    GROUP BY uid
     ) t

score 0 · Answer 6 · answered Apr 30 '23 at 08:55

Here is a easy and fast solution that can be used on any SQL Version:

CREATE TABLE MyTable
(
    fecha   VARCHAR(512),
    uid     INT
);

INSERT INTO MyTable (fecha, uid ) VALUES ('1/11/2016', '100');
INSERT INTO MyTable (fecha, uid ) VALUES ('1/11/2016', '200');
INSERT INTO MyTable (fecha, uid ) VALUES ('1/11/2016', '300');
INSERT INTO MyTable (fecha, uid ) VALUES ('1/11/2016', '400');
INSERT INTO MyTable (fecha, uid ) VALUES ('2/11/2016', '100');
INSERT INTO MyTable (fecha, uid ) VALUES ('2/11/2016', '200');
INSERT INTO MyTable (fecha, uid ) VALUES ('3/11/2016', '300');
INSERT INTO MyTable (fecha, uid ) VALUES ('3/11/2016', '400');
INSERT INTO MyTable (fecha, uid ) VALUES ('3/11/2016', '500');
INSERT INTO MyTable (fecha, uid ) VALUES ('3/11/2016', '600');
INSERT INTO MyTable (fecha, uid ) VALUES ('4/11/2016', '700');
INSERT INTO MyTable (fecha, uid ) VALUES ('5/11/2016', '700');
INSERT INTO MyTable (fecha, uid ) VALUES ('6/11/2016', '700');
INSERT INTO MyTable (fecha, uid ) VALUES ('7/11/2016', '700');
INSERT INTO MyTable (fecha, uid ) VALUES ('8/11/2016', '700');
INSERT INTO MyTable (fecha, uid ) VALUES ('8/11/2016', '900');

And the solution:

SELECT 
    t1.fecha, 
    COUNT(DISTINCT t2.uid) as daily_cumulative_count
FROM 
    MyTable t1
INNER JOIN 
    MyTable t2
ON 
    t1.fecha >= t2.fecha
GROUP BY 
    t1.fecha
ORDER BY 
    t1.fecha

You can quick test this here

Cumulative distinct count

6 Answers6

Linked