1

I have a database like this:

ID Day Value
1 2021-09-01 a
2 2021-09-01 b
3 2021-09-01 c
4 2021-09-02 d
5 2021-09-02 a
6 2021-09-02 a
7 2021-09-02 e
8 2021-09-03 c
9 2021-09-03 f
10 2021-09-03 a

I'd like to count how many different rows I have daily and all time, but the all time uniqueness should count only with the date before (the business logic behind that I'd like to count if the user is new). so I'd like to see this output:

Day Daily Unique Counts All Time Unique Counts
2021-09-01 3 3
2021-09-02 3 2
2021-09-03 3 1

notes: 2021-09-02 the Daily Unique Counts counting 'd', 'a' and 'e' but the All time Unique Counts won't count the 'a' at all because it was counted the day before.

Right now I can get the Daily Unique Counts right but I can't figure it out how can I count the All time Unique Counts column.

SELECT Date, COUNT (DISTINCT id) AS Daily Unique Counts,

FROM table GROUP BY 1 ORDER BY 1 DESC

I hope it's clear what I'd like to see, pls help with this because it's driving me crazy :)

athew
  • 45
  • 6

3 Answers3

1

Consider below approach using HyperLogLog++ functions which estimates daily cardinality from sketches and then used in final math

select day, Daily_Unique_Count, 
  ( select hll_count.merge(sketch) - hll_count.merge(if(offset = 0, null, sketch))  
    from unnest(array_reverse(prev_Sketches)) sketch with offset
  ) All_Time_Unique_Counts
from (
  select day, Daily_Unique_Count, 
    array_agg(Daily_Sketch) over(order by day) prev_Sketches
  from (
    select day, count(distinct value) as Daily_Unique_Count,
      hll_count.init(value) as Daily_Sketch
    from table
    group by day
  )
)     

if applied to sample data in your question - output is

enter image description here

The way it works (from inside out):

  1. First daily distinct counts and daily sketches are calculated
  2. Then for each day all sketches aggregated into array for that day and all previous days
  3. Finally (in most outer select) for each day - you calculate cardinality for all days previous and current day and substract with cardinality for all just previous days - kaboom! :o)
Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230
0

It would be better if you use window analytic functions to build this, because it will allow you to set the window of dates for this peculiar case. Docs are here and here a great article explaining how it works.

Your query would be something like:

SELECT day, count(distinct id) OVER part AS daily_unique_counts
FROM table
WINDOW part AS (PARTITION BY day ORDER BY day DESC
ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)

UPDATE: Just saw that it won't run using the distinct specification. But it can be solved as seen in this other post.

Fred
  • 56
  • 4
0

I think this is what you need

WITH SAMPLE AS
(
SELECT 1 AS ID, '2021-09-01' AS DAY,'A' AS VALUE UNION ALL
SELECT 2 AS ID, '2021-09-01' AS DAY,'B' AS VALUE UNION ALL
SELECT 3 AS ID, '2021-09-01' AS DAY,'C' AS VALUE UNION ALL
SELECT 4 AS ID, '2021-09-02' AS DAY,'D' AS VALUE UNION ALL
SELECT 5 AS ID, '2021-09-02' AS DAY,'A' AS VALUE UNION ALL
SELECT 6 AS ID, '2021-09-02' AS DAY,'A' AS VALUE UNION ALL
SELECT 7 AS ID, '2021-09-02' AS DAY,'E' AS VALUE UNION ALL
SELECT 8 AS ID, '2021-09-03' AS DAY,'C' AS VALUE UNION ALL
SELECT 9 AS ID, '2021-09-03' AS DAY,'F' AS VALUE UNION ALL
SELECT 10 AS ID,    '2021-09-03' AS DAY,'A' AS VALUE UNION ALL

SELECT 11 AS ID,    '2021-09-04' AS DAY,'A' AS VALUE UNION ALL
SELECT 12 AS ID,    '2021-09-04' AS DAY,'K' AS VALUE UNION ALL
SELECT 13 AS ID,    '2021-09-04' AS DAY,'D' AS VALUE UNION ALL
SELECT 14 AS ID,    '2021-09-05' AS DAY,'A' AS VALUE 
),
DISTINCT_COUNT AS
(
SELECT  DAY, 
        COUNT (DISTINCT VALUE) AS DAILY_UNIQUE_COUNTS
FROM SAMPLE
GROUP BY DAY
)
,CTE3_SUBTRACT_COUNT
AS
(
SELECT  A.DAY AS A_DAY,
        COUNT(DISTINCT A.VALUE) AS SUBTRACT_ME
FROM SAMPLE A
JOIN SAMPLE B 
ON A.DAY > B.DAY AND A.VALUE = B.VALUE
GROUP BY A.DAY
)
SELECT  A.DAY,
        MAX(DAILY_UNIQUE_COUNTS) AS DAILY_UNIQUE_COUNTS,
        MIN(DAILY_UNIQUE_COUNTS - IFNULL(SUBTRACT_ME,0)) AS ALL_TIME_COUNT
FROM SAMPLE A
LEFT JOIN CTE3_SUBTRACT_COUNT B
ON A.DAY = B.A_DAY
LEFT JOIN DISTINCT_COUNT C ON
C.DAY = A.DAY
GROUP BY A.DAY

Mr.Batra
  • 787
  • 1
  • 5
  • 11