rolling count of distinct users

Question

I want to calculate the rolling count of unique users with variable time windows. Here's an example of what I have and the outcome I want.

have <- data.frame(user = c(1, 2, 
                            2, 3, 
                            1, 2, 3, 
                            4, 
                            3, 4,
                            4),
                   when = lubridate::ymd("2020-01-01",
                                         "2020-01-01",
                                         "2020-01-02",
                                         "2020-01-02",
                                         "2020-01-03",
                                         "2020-01-03",
                                         "2020-01-03",
                                         "2020-01-05",
                                         "2020-01-06",
                                         "2020-01-06",
                                         "2020-01-07"))
have 
#   user       when
#1     1 2020-01-01
#2     2 2020-01-01
#3     2 2020-01-02
#4     3 2020-01-02
#5     1 2020-01-03
#6     2 2020-01-03
#7     3 2020-01-03 # note that Jan 4 is missing
#8     4 2020-01-05
#9     3 2020-01-06
#10    4 2020-01-06
#11    4 2020-01-07

want <- data.frame(when=c("2020-01-01",
                          "2020-01-02",
                          "2020-01-03",
                          "2020-01-04",
                          "2020-01-05",
                          "2020-01-06",
                          "2020-01-07"),
                   twoDayCount=c(2, # Jan 1: 1, 2
                                 3, # Jan 1-2: 1, 2, 3
                                 3, # Jan 2-3: 1, 2, 3
                                 3, # Jan 3-4: 1, 2, 3
                                 1, # Jan 4-5: 4
                                 2, # Jan 5-6: 3, 4
                                 2  # Jan 6-7: 3, 4
                                 )
                   )
want
#        when twoDayCount
#1 2020-01-01           2 # users: 1, 2
#2 2020-01-02           3 # users: 1, 2, 3
#3 2020-01-03           3 # users: 1, 2, 3
#4 2020-01-04           3 # users: 1, 2, 3
#5 2020-01-05           1 # users: 4
#6 2020-01-06           2 # users: 3, 4
#7 2020-01-07           2 # users: 3, 4

I've tried a few approaches but they have me counting all rows per window, not distinct users per window. For instance, the desired 2-day unique user count on Jan 3 is 3 (users 1, 2, 3), not 5 rows (with users 2 and 3 appearing twice each).

My actual use case needs the rolling window period (2 days in this example) to be an input.

Ideally the solution works with functions that {dbplyr} can translate to sql or via native sql that can be run with {dbplyr}.

This answer gives an idea for how to solve with sql:

SELECT when, count(DISTINCT user) AS dist_users 
FROM  (SELECT generate_series('2020-01-01'::date, '2020-01-07'::date, '1d')::date) AS g(when) 
LEFT   JOIN tbl t ON t.when BETWEEN g.when - 2 AND g.when 
GROUP  BY 1 
ORDER  BY 1;

wurli · Accepted Answer · 2020-07-22T20:10:05.307

4

Using functions from dplyr and tidyr, for the 1-day window case:

have %>% 
  group_by(when) %>% 
  summarise(twoDayCount = n_distinct(user))

For larger windows:

window <- 2
have %>% 
  rowwise() %>% 
  mutate(when = list(when + lubridate::days(0:(window - 1)))) %>% 
  unnest(cols = when) %>%
  group_by(when) %>% 
  summarise(twoDayCount = n_distinct(user))

Note that this method will give you rows for a few later dates (in this case Jan 08), which you might want to remove.

If performance is an issue for larger datasets, here is a much faster (but slightly less elegant) solution:

window <- 2
seq.Date(min(have$when), max(have$when), by = "day") %>% 
  purrr::map(function(date) {
    have %>% 
        filter(when <= date, when >= date - days(window - 1))  %>%
        summarise(userCount = n_distinct(user)) %>%
        mutate(when = date)
    }) %>% 
  bind_rows()

edited Jul 22 '20 at 20:10

answered Jul 22 '20 at 14:47

wurli

2,314
10
17

Thanks, @BluVoxe. Very creative answer. I like it. Trouble is that I can't get it to finish running on my actual use case with >1M rows and windows of up to 30 days :( – Eric Green Jul 22 '20 at 18:36
1

I can see how that would start to hit performance! I like the problem so I'll have a crack at a higher performance solution :) – wurli Jul 22 '20 at 18:40
1

@EricGreen. If you have a size and speed problem, you likely want to switch to data.table. Change date to day since a particular date. Then set date and user as keys. You could then loop through each day, subset the previous 30 days, work out the unique users, and thanks to the binary search ability of data.table (which I think works both in subsetting and getting unique values), this might give you a huge speed up. If I wasn't busy I would work up a solution – Robert Wilson Jul 22 '20 at 19:54
1

@EricGreen I've now edited the answer to give a solution that performs much faster on larger datasets. If it solves your issue do you mind accepting the solution? – wurli Jul 22 '20 at 20:08

score 1 · Answer 2 · answered Jul 22 '20 at 15:25

It is probably a bit clumsy with the loop. But seems to work...

want <- data.frame(when = seq.Date(min(have$when), max(have$when), by = 1), 
                   twoDayCount = NA)

for (iDate in min(want$when):(max(want$when))) {
  dateWindow = c(iDate, iDate - 1)
  uniqueUsers = unique(have$user[have$when %in% dateWindow])
  want$twoDayCount[want$when == iDate] = length(uniqueUsers)
}

        when twoDayCount
1 2020-01-01           2
2 2020-01-02           3
3 2020-01-03           3
4 2020-01-04           3
5 2020-01-05           1
6 2020-01-06           2
7 2020-01-07           2

score 1 · Answer 3 · answered Jul 22 '20 at 15:39

This probably will not port to dbplyr. But you can approach this using a tidyverse approach.

You first want to create a nested dataframe. 3 columns. First is date. Second are the users for that date, the second are users from the previous day (if available). You can then use purrr::map2 to apply a function to those data sets to find out how many unique users you have.

library(dplyr)
library(lubridate)
library(tidyr)
library(purrr)

# A function to get the number of distinct elements in a couple of dfs
num_distinct <- function(x,y){
  length(unique(c(x$user,y$user)))
}


df <- have %>% 
  distinct() %>% 
  group_by(when) %>% 
  nest() %>% 
  ungroup() %>% 
  inner_join(
    have %>% 
      distinct() %>% 
      group_by(when) %>% 
      nest()  %>% 
      ungroup() %>% 
      mutate(when = when + days(1)) %>% 
      rename(lag = data)
  ) 
  # calculate the rolling number of uniques
  df %>% 
  mutate(rolling = map2(data, lag, num_distinct)) %>% 
  select(-data, -lag) %>% 
  unnest(rolling)

This only shows results for dates with actual 2 day periods available, so may need to be modified depending on whether you what you want included.

score 1 · Answer 4 · answered Jul 22 '20 at 20:41

A scalable solution for very large datasets would be to use data.table. In the example below I show how this would work if day was number of days since the start date.

library(tidyverse)
library(data.table)

window <- 30
dt <- tibble(day = seq(1:10000)) %>% 
  mutate(user = purrr::map(day, function(.) sample(1:10000, 10000, replace = TRUE))) %>% 
  unnest(user) %>% 
  as.data.table()

all_res <- list()

setkey(dt, day)

tracker <- 1
for(dd in unique(dt$day)){

  sub_dd <- dt[.(max(1,(dd-window)):dd)]

  all_res[[tracker]] <- tibble(day = dd, users = 
     length(unique(sub_dd[,user])))

  tracker <- tracker + 1

 }

all_res <- all_res %>% 
  bind_rows()

The key here is setting the key, which enables data.table to use binary search to speed up the filtering https://cran.r-project.org/web/packages/data.table/vignettes/datatable-keys-fast-subset.html.

rolling count of distinct users

4 Answers4