2

I have a dataset that has a list of users that are connected to the server at every 15 minutes, e.g.

May 7, 2020, 8:09 AM   user1
May 7, 2020, 8:09 AM   user2
...
May 7, 2020, 8:24 AM   user1
May 7, 2020, 8:24 AM   user3
... 

And I'd like to get a number of active users for every day, e.g.

May 7, 2020   71
May 8, 2020   83

Now, the tricky part. An active user is defined if he/she has been connected 80% of the time or more across the last 7 days. This means that, if there are 672 15-minute intervals in a week (1440 / 15 x 7), then a user has to be displayed 538 (672 x 0.8) times.

My code so far is:

SELECT
    DATE_TRUNC('week', ts) AS ts_week
    ,COUNT(DISTINCT user)
FROM activeusers
GROUP BY 1

Which only gives a list of unique users connected at every week.

July 13, 2020, 12:00 AM   435
July 20, 2020, 12:00 AM   267

But I'd like to implement the active user definition, as well as get the result for every day, not just Mondays.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228

3 Answers3

2

The resulting special difficulty here is that users might qualify for days where they have no connections at all, if they were connected sufficiently during the previous 6 days.

That makes it harder to use a window function. Aggregating in a LATERAL subquery is the obvious alternative:

WITH daily AS (  -- ① granulate daily
   SELECT ts::date AS the_day
        , "user"
        , count(*)::int AS daily_cons
   FROM   activeusers
   GROUP  BY 1, 2
  )
SELECT d.the_day, count("user") AS active_users
FROM  ( --  ② time frame
   SELECT generate_series (timestamp '2020-07-01'
                         , LOCALTIMESTAMP
                         , interval '1 day')::date
   ) d(the_day)
LEFT   JOIN LATERAL (
   SELECT "user"
   FROM   daily d
   WHERE  d.the_day >= d.the_day - 6
   AND    d.the_day <= d.the_day
   GROUP  BY "user"
   HAVING sum(daily_cons) >= 538  -- ③
   ) sum7 ON true
ORDER  BY d.the_day;

① The CTE daily is optional, but starting with daily aggregates should help performance a lot.

② You'll have to define the time frame somehow. I chose the current year. Replace with your choice. To work with the total range present in your table, use instead:

SELECT generate_series (min(the_day)::timestamp
                      , max(the_day)::timestamp
                      , interval '1 day')::date AS the_day
FROM   daily

Consider basics here:

This also overcomes the "special difficulty" mentioned above.

③ The condition in the HAVING clause eliminates all rows with insufficient connections over the last 7 days (including "today").

Related:

Aside:
You wouldn't really use the reserved word "user" as identifier.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
1

Because you want the active user for every day but are determining by week, I think you might use a CROSS APPLY to duplicate the count for every day. The FROM part of the query will give you the days and the users, the CROSS APPLY will limit to active users. You can specify in the final WHERE what users or dates you want.

SELECT users.UserName, users.LogDate
FROM (
    SELECT UserName, CAST(ts AS DATE) AS LogDate
    FROM activeusers
    GROUP BY CAST(ts AS DATE)
    ) AS users
CROSS APPLY (
    SELECT UserName, COUNT(1)
    FROM activeusers AS a
    WHERE a.UserName = users.UserName AND CAST(ts AS DATE) BETWEEN DATEADD(WEEK, -1, LogDate) AND LogDate
    GROUP BY UserName
    HAVING COUNT(1) >= 538
    ) AS activeUsers
WHERE users.LogDate > '2020-01-01' AND users.UserName = 'user1'

This is SQL Server, you may need to make revisions for PostgreSQL. CROSS APPLY may translate to LEFT JOIN LATERAL (...) ON true.

llessurt
  • 555
  • 3
  • 14
  • @Erwin is correct that this does not include users that have no connections for the selected day. This could be remedied by starting with all users. – llessurt Jul 21 '20 at 13:59
1

I have done something similar to this for device monitoring reports. I was never able to come up with a solution that does not involve building a calendar and cross joining it to a distinct list of devices (user values in your case).

This deliberately verbose query builds the cross join, gets active counts per user and ddate, performs the running sum() over seven days, and then counts the number of users on a given ddate that had 538 or more actives in the seven days ending on that ddate.

with drange as (
  select min(ts) as start_ts, max(ts) as end_ts
    from activeusers
), alldates as (
  select (start_ts + make_interval(days := x))::date as ddate
    from drange
   cross join generate_series(0, date_part('day', end_ts - start_ts)::int) as gs(x)
), user_dates as (
  select ddate, "user"
    from alldates
   cross join (select distinct "user" from activeusers) u
), user_date_counts as (
  select u.ddate, u."user",
         sum(case when a.user is null then 0 else 1 end) as actives
    from user_dates u
    left join activeusers a
           on a."user" = u."user"
          and a.ts::date = u.ddate
   group by u.ddate, u."user"
), running_window as (
  select ddate, "user",
         sum(actives) over (partition by user
                                order by ddate
                         rows between 6 preceding
                                  and current row) seven_days
    from user_date_counts
), flag_active as (
  select ddate, "user",
         seven_days >= 538 as is_active
    from running_window
)
select ddate, count(*) as active_users
  from flag_active
 where is_active
 group by ddate
;

Mike Organek
  • 11,647
  • 3
  • 11
  • 26