5

So after looking at what seems to be a common question being asked and not being able to get any solution to work for me, I decided I should ask for myself.

I have a data set with two columns: session_start_time, uid

I am trying to generate a rolling 30 day tally of unique sessions

It is simple enough to query for the number of unique uids per day:

SELECT 
COUNT(DISTINCT(uid)) 
FROM segment_clean.users_sessions
WHERE session_start_time >= CURRENT_DATE - interval '30 days'

it is also relatively simple to calculate the daily unique uids over a date range.

SELECT
DATE_TRUNC('day',session_start_time) AS "date"
,COUNT(DISTINCT uid) AS "count"
FROM segment_clean.users_sessions
WHERE session_start_time >= CURRENT_DATE - INTERVAL '90 days'
GROUP BY date(session_start_time)

I then I tried several ways to do a rolling 30 day unique count over a time interval

SELECT 
DATE(session_start_time) AS "running30day"
,COUNT(distinct(
    case when date(session_start_time) >= running30day - interval '30 days'
    AND date(session_start_time) <= running30day
    then uid
    end)
) AS "unique_30day"
FROM segment_clean.users_sessions
WHERE session_start_time >= CURRENT_DATE - interval '3 months'
GROUP BY date(session_start_time) 
Order BY running30day desc

I really thought this would work but when looking into the results, it appears I'm getting the same results as I was when doing the daily unique rather than the unique over 30days.

I am writing this query from Metabase using the SQL query editor. the underlying tables are in redshift.

If you read this far, thank you, your time has value and I appreciate the fact that you have spent some of it to read my question.

EDIT: As rightfully requested, I added an example of the data set I'm working with and the desired outcome.

+-----+-------------------------------+
| UID |      SESSION_START_TIME       |
+-----+-------------------------------+
|     |                               |
| 10  | 2020-01-13T01:46:07.000-05:00 |
|     |                               |
| 5   | 2020-01-13T01:46:07.000-05:00 |
|     |                               |
| 3   | 2020-01-18T02:49:23.000-05:00 |
|     |                               |
| 9   | 2020-03-06T18:18:28.000-05:00 |
|     |                               |
| 2   | 2020-03-06T18:18:28.000-05:00 |
|     |                               |
| 8   | 2020-03-31T23:13:33.000-04:00 |
|     |                               |
| 3   | 2020-08-28T18:23:15.000-04:00 |
|     |                               |
| 2   | 2020-08-28T18:23:15.000-04:00 |
|     |                               |
| 9   | 2020-08-28T18:23:15.000-04:00 |
|     |                               |
| 3   | 2020-08-28T18:23:15.000-04:00 |
|     |                               |
| 8   | 2020-09-15T16:40:29.000-04:00 |
|     |                               |
| 3   | 2020-09-21T20:49:09.000-04:00 |
|     |                               |
| 1   | 2020-11-05T21:31:48.000-05:00 |
|     |                               |
| 6   | 2020-11-05T21:31:48.000-05:00 |
|     |                               |
| 8   | 2020-12-12T04:42:00.000-05:00 |
|     |                               |
| 8   | 2020-12-12T04:42:00.000-05:00 |
|     |                               |
| 5   | 2020-12-12T04:42:00.000-05:00 |
+-----+-------------------------------+

bellow is what the result I would like looks like:

+------------+---------------------+
|    DATE    | UNIQUE 30 DAY COUNT |
+------------+---------------------+
|            |                     |
| 2020-01-13 | 3                   |
|            |                     |
| 2020-01-18 | 1                   |
|            |                     |
| 2020-03-06 | 3                   |
|            |                     |
| 2020-03-31 | 1                   |
|            |                     |
| 2020-08-28 | 4                   |
|            |                     |
| 2020-09-15 | 2                   |
|            |                     |
| 2020-09-21 | 1                   |
|            |                     |
| 2020-11-05 | 2                   |
|            |                     |
| 2020-12-12 | 2                   |
+------------+---------------------+

Thank you

  • could you also add what the data/table looks like to help in reproducing the issue? – Smurphy0000 Aug 11 '20 at 03:29
  • you are still using CURRENT_DATE in your where : WHERE session_start_time >= CURRENT_DATE - interval '3 months' Could we the value that you are passing to running30day and interval – GregM Aug 11 '20 at 03:48
  • 1
    Please, give an example of data: input and expected result. – diziaq Aug 11 '20 at 05:24
  • Thank you, I went back and added in an example of the data set I'm working with and an example of the outcome I am looking for, – Tobiah Adam Aug 11 '20 at 13:44

2 Answers2

0

You can approach this by keeping a counter of when users are counted and then uncounted -- 30 (or perhaps 31) days later. Then, determine the "islands" of being counted, and aggregate. This involves:

  • Unpivoting the data to have an "enters count" and "leaves" count for each session.
  • Accumulate the count so on each day for each user you know whether they are counted or not.
  • This defines "islands" of counting. Determine where the islands start and stop -- getting rid of all the detritus in-between.
  • Now you can simply do a cumulative sum on each date to determine the 30 day session.

In SQL, this looks like:

with t as (
      select uid, date_trunc('day', session_start_time) as s_day, 1 as inc
      from users_sessions
      union all
      select uid, date_trunc('day', session_start_time) + interval '31 day' as s_day, -1
      from users_sessions
     ),
     tt as (  -- increment the ins and outs to determine whether a uid is in or out on a given day
      select uid, s_day, sum(inc) as day_inc,
             sum(sum(inc)) over (partition by uid order by s_day rows between unbounded preceding and current row) as running_inc
      from t
      group by uid, s_day
     ),
     ttt as (  -- find the beginning and end of the islands
      select tt.uid, tt.s_day,
             (case when running_inc > 0 then 1 else -1 end) as in_island
      from (select tt.*,
                   lag(running_inc) over (partition by uid order by s_day) as prev_running_inc,
                   lead(running_inc) over (partition by uid order by s_day) as next_running_inc
            from tt
           ) tt
      where running_inc > 0 and (prev_running_inc = 0 or prev_running_inc is null) or
            running_inc = 0 and (next_running_inc > 0 or next_running_inc is null)
     )
select s_day,
       sum(sum(in_island)) over (order by s_day rows between unbounded preceding and current row) as active_30
from ttt
group by s_day;

Here is a db<>fiddle.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Hi Gordon, thank you for the suggested answer! unfortunately, I'm getting the following error: "[Amazon](500310) Invalid operation: Aggregate window functions with an ORDER BY clause require a frame clause" – Tobiah Adam Aug 11 '20 at 13:48
  • @TobiahAdam . . . That is easily fixed. If you add a db<>fiddle using Postgres, it will be easier to see this work. – Gordon Linoff Aug 11 '20 at 13:51
  • my apologies as I am really fairly new to SQL and 100% self-taught through the aid of forums like this. I am unfamiliar with what a db<>fiddle is or how I add one. – Tobiah Adam Aug 11 '20 at 14:22
  • I noticed you updated your answer thank you! unfortunately, I now get the following error: '[Amazon](500310) Invalid operation: column "t.s_day" must appear in the GROUP BY clause or be used in an aggregate function' – Tobiah Adam Aug 11 '20 at 18:09
  • @TobiahAdam . . . I fixed the typos in the code and added a db<>fiddle. To be honest, you only had two columns in the table, so I should have created one myself originally to avoid the typos. – Gordon Linoff Aug 11 '20 at 22:31
0

I'm pretty sure the easier way to do this is to use a join. This creates a list of all the distinct users who had a session on each day and a list of all distinct dates in the data. Then it one-to-many joins the user list to the date list and counts the distinct users, the key here is the expanded join criteria that matches a range of dates to a single date via a system of inequalities.

with users as 
    (select
    distinct uid,
    date_trunc('day',session_start_time) AS dt
    from <table>
    where session_start_time >= '2021-05-01'),
dates as 
    (select
    distinct date_trunc('day',session_start_time) AS dt
    from <table>
    where session_start_time >= '2021-05-01')
select 
    count(distinct uid), 
    dates.dt 
    from users
join
    dates
    on users.dt >= dates.dt - 29
    and users.dt <= dates.dt
    group by dates.dt
    order by dt desc
;
user3130288
  • 53
  • 1
  • 3