30 day rolling count of distinct IDs

Question

So after looking at what seems to be a common question being asked and not being able to get any solution to work for me, I decided I should ask for myself.

I have a data set with two columns: session_start_time, uid

I am trying to generate a rolling 30 day tally of unique sessions

It is simple enough to query for the number of unique uids per day:

SELECT 
COUNT(DISTINCT(uid)) 
FROM segment_clean.users_sessions
WHERE session_start_time >= CURRENT_DATE - interval '30 days'

it is also relatively simple to calculate the daily unique uids over a date range.

SELECT
DATE_TRUNC('day',session_start_time) AS "date"
,COUNT(DISTINCT uid) AS "count"
FROM segment_clean.users_sessions
WHERE session_start_time >= CURRENT_DATE - INTERVAL '90 days'
GROUP BY date(session_start_time)

I then I tried several ways to do a rolling 30 day unique count over a time interval

SELECT 
DATE(session_start_time) AS "running30day"
,COUNT(distinct(
    case when date(session_start_time) >= running30day - interval '30 days'
    AND date(session_start_time) <= running30day
    then uid
    end)
) AS "unique_30day"
FROM segment_clean.users_sessions
WHERE session_start_time >= CURRENT_DATE - interval '3 months'
GROUP BY date(session_start_time) 
Order BY running30day desc

I really thought this would work but when looking into the results, it appears I'm getting the same results as I was when doing the daily unique rather than the unique over 30days.

I am writing this query from Metabase using the SQL query editor. the underlying tables are in redshift.

If you read this far, thank you, your time has value and I appreciate the fact that you have spent some of it to read my question.

EDIT: As rightfully requested, I added an example of the data set I'm working with and the desired outcome.

+-----+-------------------------------+
| UID |      SESSION_START_TIME       |
+-----+-------------------------------+
|     |                               |
| 10  | 2020-01-13T01:46:07.000-05:00 |
|     |                               |
| 5   | 2020-01-13T01:46:07.000-05:00 |
|     |                               |
| 3   | 2020-01-18T02:49:23.000-05:00 |
|     |                               |
| 9   | 2020-03-06T18:18:28.000-05:00 |
|     |                               |
| 2   | 2020-03-06T18:18:28.000-05:00 |
|     |                               |
| 8   | 2020-03-31T23:13:33.000-04:00 |
|     |                               |
| 3   | 2020-08-28T18:23:15.000-04:00 |
|     |                               |
| 2   | 2020-08-28T18:23:15.000-04:00 |
|     |                               |
| 9   | 2020-08-28T18:23:15.000-04:00 |
|     |                               |
| 3   | 2020-08-28T18:23:15.000-04:00 |
|     |                               |
| 8   | 2020-09-15T16:40:29.000-04:00 |
|     |                               |
| 3   | 2020-09-21T20:49:09.000-04:00 |
|     |                               |
| 1   | 2020-11-05T21:31:48.000-05:00 |
|     |                               |
| 6   | 2020-11-05T21:31:48.000-05:00 |
|     |                               |
| 8   | 2020-12-12T04:42:00.000-05:00 |
|     |                               |
| 8   | 2020-12-12T04:42:00.000-05:00 |
|     |                               |
| 5   | 2020-12-12T04:42:00.000-05:00 |
+-----+-------------------------------+

bellow is what the result I would like looks like:

+------------+---------------------+
|    DATE    | UNIQUE 30 DAY COUNT |
+------------+---------------------+
|            |                     |
| 2020-01-13 | 3                   |
|            |                     |
| 2020-01-18 | 1                   |
|            |                     |
| 2020-03-06 | 3                   |
|            |                     |
| 2020-03-31 | 1                   |
|            |                     |
| 2020-08-28 | 4                   |
|            |                     |
| 2020-09-15 | 2                   |
|            |                     |
| 2020-09-21 | 1                   |
|            |                     |
| 2020-11-05 | 2                   |
|            |                     |
| 2020-12-12 | 2                   |
+------------+---------------------+

Thank you

could you also add what the data/table looks like to help in reproducing the issue? — Smurphy0000, Aug 11 '20 at 03:29
you are still using CURRENT_DATE in your where : WHERE session_start_time >= CURRENT_DATE - interval '3 months' Could we the value that you are passing to running30day and interval — GregM, Aug 11 '20 at 03:48
Thank you, I went back and added in an example of the data set I'm working with and an example of the outcome I am looking for, — Tobiah Adam, Aug 11 '20 at 13:44

Gordon Linoff · Answer 1 · 2020-08-11T22:30:49.270

You can approach this by keeping a counter of when users are counted and then uncounted -- 30 (or perhaps 31) days later. Then, determine the "islands" of being counted, and aggregate. This involves:

Unpivoting the data to have an "enters count" and "leaves" count for each session.
Accumulate the count so on each day for each user you know whether they are counted or not.
This defines "islands" of counting. Determine where the islands start and stop -- getting rid of all the detritus in-between.
Now you can simply do a cumulative sum on each date to determine the 30 day session.

In SQL, this looks like:

with t as (
      select uid, date_trunc('day', session_start_time) as s_day, 1 as inc
      from users_sessions
      union all
      select uid, date_trunc('day', session_start_time) + interval '31 day' as s_day, -1
      from users_sessions
     ),
     tt as (  -- increment the ins and outs to determine whether a uid is in or out on a given day
      select uid, s_day, sum(inc) as day_inc,
             sum(sum(inc)) over (partition by uid order by s_day rows between unbounded preceding and current row) as running_inc
      from t
      group by uid, s_day
     ),
     ttt as (  -- find the beginning and end of the islands
      select tt.uid, tt.s_day,
             (case when running_inc > 0 then 1 else -1 end) as in_island
      from (select tt.*,
                   lag(running_inc) over (partition by uid order by s_day) as prev_running_inc,
                   lead(running_inc) over (partition by uid order by s_day) as next_running_inc
            from tt
           ) tt
      where running_inc > 0 and (prev_running_inc = 0 or prev_running_inc is null) or
            running_inc = 0 and (next_running_inc > 0 or next_running_inc is null)
     )
select s_day,
       sum(sum(in_island)) over (order by s_day rows between unbounded preceding and current row) as active_30
from ttt
group by s_day;

Here is a db<>fiddle.

Hi Gordon, thank you for the suggested answer! unfortunately, I'm getting the following error: "[Amazon](500310) Invalid operation: Aggregate window functions with an ORDER BY clause require a frame clause" — Tobiah Adam, Aug 11 '20 at 13:48
@TobiahAdam . . . That is easily fixed. If you add a db<>fiddle using Postgres, it will be easier to see this work. — Gordon Linoff, Aug 11 '20 at 13:51
my apologies as I am really fairly new to SQL and 100% self-taught through the aid of forums like this. I am unfamiliar with what a db<>fiddle is or how I add one. — Tobiah Adam, Aug 11 '20 at 14:22
I noticed you updated your answer thank you! unfortunately, I now get the following error: '[Amazon](500310) Invalid operation: column "t.s_day" must appear in the GROUP BY clause or be used in an aggregate function' — Tobiah Adam, Aug 11 '20 at 18:09
@TobiahAdam . . . I fixed the typos in the code and added a db<>fiddle. To be honest, you only had two columns in the table, so I should have created one myself originally to avoid the typos. — Gordon Linoff, Aug 11 '20 at 22:31

score 0 · Answer 2 · answered Jul 13 '21 at 14:02

I'm pretty sure the easier way to do this is to use a join. This creates a list of all the distinct users who had a session on each day and a list of all distinct dates in the data. Then it one-to-many joins the user list to the date list and counts the distinct users, the key here is the expanded join criteria that matches a range of dates to a single date via a system of inequalities.

with users as 
    (select
    distinct uid,
    date_trunc('day',session_start_time) AS dt
    from <table>
    where session_start_time >= '2021-05-01'),
dates as 
    (select
    distinct date_trunc('day',session_start_time) AS dt
    from <table>
    where session_start_time >= '2021-05-01')
select 
    count(distinct uid), 
    dates.dt 
    from users
join
    dates
    on users.dt >= dates.dt - 29
    and users.dt <= dates.dt
    group by dates.dt
    order by dt desc
;

30 day rolling count of distinct IDs

2 Answers2