Count of unique values in a rolling date range for R

Question

This question already has an answer for SQL, and I was able to implement that solution in R using sqldf. However, I've been unable to find a way to implement it using data.table.

The problem is to count the distinct values of one column within a rolling date range, e.g. (and quoting directly from the linked question) if the data looked like this:

Date   | email 
-------+----------------
1/1/12 | test@test.com
1/1/12 | test1@test.com
1/1/12 | test2@test.com
1/2/12 | test1@test.com
1/2/12 | test2@test.com
1/3/12 | test@test.com
1/4/12 | test@test.com
1/5/12 | test@test.com
1/5/12 | test@test.com
1/6/12 | test@test.com
1/6/12 | test@test.com
1/6/12 | test1@test.com

Then the result set would look something like this if we used a date period of 3 days

date   | count(distinct email)
-------+------
1/1/12 | 3
1/2/12 | 3
1/3/12 | 3
1/4/12 | 3
1/5/12 | 2
1/6/12 | 2

Here's the code to create the same data in R using data.table:

date <- as.Date(c('2012-01-01','2012-01-01','2012-01-01',
                  '2012-01-02','2012-01-02','2012-01-03',
                  '2012-01-04','2012-01-05','2012-01-05',
                  '2012-01-06','2012-01-06','2012-01-06'))
email <- c('test@test.com', 'test1@test.com','test2@test.com',
           'test1@test.com', 'test2@test.com','test@test.com',
           'test@test.com','test@test.com','test@test.com',
           'test@test.com','test@test.com','test1@test.com')
dt <- data.table(date, email)

Any help on this would be much appreciated. Thanks!

Edit 1:

This is a toy problem that I want to apply to a much larger data set, so use of Cartesian products is problematic. Instead, I'd like something equivalent to a correlated subquery in SQL, e.g. the solution from the question that I originally linked was:

SELECT day
     ,(SELECT count(DISTINCT email)
       FROM   tbl
       WHERE  day BETWEEN t.day - 2 AND t.day -- period of 3 days
      ) AS dist_emails
FROM   tbl t
WHERE  day BETWEEN '2012-01-01' AND '2012-01-06'  
GROUP  BY 1
ORDER  BY 1;

Edit 2: Here is some timing based on @MichaelChirico's solution, as requested by @jangorecki:

# The data
> dim(temp)
[1] 2627785       4
> head(temp)
         date category1 category2 itemId
1: 2013-11-08         0         2   1713
2: 2013-11-08         0         2  90485
3: 2013-11-08         0         2  74249
4: 2013-11-08         0         2   2592
5: 2013-11-08         0         2   2592
6: 2013-11-08         0         2    765
> uniqueN(temp$itemId)
[1] 13510
> uniqueN(temp$date)
[1] 127

# Timing for data.table
> system.time(dtTime <- temp[,
+   .(count = temp[.(seq.Date(.BY$date - 6L, .BY$date, "day"), 
+   .BY$category1, .BY$category2 ), uniqueN(itemId), nomatch = 0L]), 
+  by = c("date","category1","category2")])
   user  system elapsed 
  6.913   0.130   6.940 
> 
# Time for sqldf
> system.time(sqlDfTime <- 
+ sqldf(c("create index ldx on temp(date, category1, category2)",
+ "SELECT date, category1, category2,
+ (SELECT count(DISTINCT itemId)
+   FROM   temp
+   WHERE category1 = t.category1 AND category2 = t.category2 AND
+   date BETWEEN t.date - 6 AND t.date 
+   ) AS numItems
+ FROM temp t
+ GROUP BY date, category1, category2
+ ORDER BY 1;"))
   user  system elapsed 
 87.225   0.098  87.295

The outputs are eqivalent, but using data.table rather than sqldf resulted in a 12.5x speedup. Pretty substantial!

Hi, could you please explain the distinct email count, that way would be able to help. Thank you. — Sowmya S. Manian, Apr 10 '16 at 19:24
If you only want daily count of unique emails, how about `dt[,length(unique(email)),by=date]` ? — R.S., Apr 10 '16 at 20:06
@R.S. that doesn't appear to be desired output. It seems OP wants `#distinct emails among today, yesterday, and the day before`. Also `uniqueN` is superior to `length(unique())`. — MichaelChirico, Apr 10 '16 at 20:08
yeah, I think so too. That's why I did not add it as an answer. — R.S., Apr 10 '16 at 20:10
I haven't looked carefully at your question, but this seems similar / maybe helpful: http://stackoverflow.com/q/35993231 — Frank, Apr 10 '16 at 20:24
non-equi joins are currently being implemented in data.table. Once it's fully done, one should be able to do: `dt[.(Date=unique(Date)), uniqueN(email), by=.EACHI, on = .(Date-2L <= Date)]`. Will answer once done. — Arun, Apr 10 '16 at 21:01

MichaelChirico · Accepted Answer · 2019-04-19T04:02:08.827

Here's something that works, taking advantage of the new non-equijoins feature of data.table.

dt[dt[ , .(date3=date, date2 = date - 2, email)], 
   on = .(date >= date2, date<=date3), 
   allow.cartesian = TRUE
   ][ , .(count = uniqueN(email)), 
      by = .(date = date + 2)]
#          date V1
# 1: 2011-12-30  3
# 2: 2011-12-31  3
# 3: 2012-01-01  3
# 4: 2012-01-02  3
# 5: 2012-01-03  1
# 6: 2012-01-04  2

To be honest I'm a bit miffed on how this is working exactly, but the idea is to join dt to itself on date, matching any date that is between 2 days ago and today. I'm not sure why we have to clean up by setting date = date + 2 afterwards.

Here's an approach using keys:

setkey(dt, date)

dt[ , .(count = dt[.(seq.Date(.BY$date - 2L, .BY$date, "day")),
                   uniqueN(email), nomatch = 0L]), by = date]

re cleanup, `date` field is from `i` arg, this is base R consistency unexpected behavior, you can use `x.date` to take `date` from `x`. Described in [data.table#1615](https://github.com/Rdatatable/data.table/issues/1615). — jangorecki, Apr 10 '16 at 21:53

score 3 · Answer 2 · answered Jun 25 '16 at 01:04

With the recently implemented non-equi joins feature in the current development version of data.table, v1.9.7, this can be done as follows:

dt[.(date3=unique(dt$date2)), .(count=uniqueN(email)), on=.(date>=date3, date2<=date3), by=.EACHI]
#          date      date2 count
# 1: 2011-12-30 2011-12-30     3
# 2: 2011-12-31 2011-12-31     3
# 3: 2012-01-01 2012-01-01     3
# 4: 2012-01-02 2012-01-02     3
# 5: 2012-01-03 2012-01-03     1
# 6: 2012-01-04 2012-01-04     2

Count of unique values in a rolling date range for R

2 Answers2

Linked