As indicated in the answer to the related question Data.table: how to get the blazingly fast subsets it promises and apply to a second data.table, this can be solved by updating in a non-equi join which is possible with data.table
.
The difference to the linked question is that here we need to create the cross join CJ()
of all unique ID
s with the vector of dates on our own before joining with lsr
.
The OP has provided a series of dates time.months
whose defintion
time.months <- as.Date("2013-02-01")+(365.25/12)*(0:192) #dates to evaluate at
leads to "crooked" dates which is only visible if coerced to numeric or POSIXct:
head(lubridate::as_datetime(time.months))
[1] "2013-02-01 00:00:00 UTC" "2013-03-03 10:30:00 UTC" "2013-04-02 21:00:00 UTC"
[4] "2013-05-03 07:30:00 UTC" "2013-06-02 18:00:00 UTC" "2013-07-03 04:30:00 UTC"
The issue is that these "dates" are not aligned with midnight but start somewhere during the day. To avoid these ambiguities, the seq()
function can be used
dates <- seq(as.Date("2013-02-01"), length.out = 193, by = "month")
which creates a series of dates starting on the first day of each month.
In addition, data.table
's IDate
class is used which stores dates as integers (4 bytes) instead of double (8 bytes). This saves memory as well as processing time because the usually faster integer arithmetic can be used.
# coerce Date to IDate
idates <- as.IDate(dates)
setDT(lsr)[, eksd := as.IDate(eksd)][, ENDDATE := as.IDate(ENDDATE)]
# cross join unique IDs with dates
CJ(ID = lsr$ID, date = idates, unique = TRUE)[
# intialize result column
, AH := 0L][
# non-equi join and ...
lsr, on = .(ID, date >= eksd, date < ENDDATE),
# ... update only matching rows
AH := as.integer(ENDDATE - x.date)][
# reshape from long to wide format
, dcast(.SD, ID ~ date)]
ID 2013-02-01 2013-03-01 2013-04-01 2013-05-01 2013-06-01 2013-07-01 2013-08-01 [...]
1: 1 64 36 5 0 0 0 0
2: 2 0 0 110 80 49 19 0
3: 3 63 35 4 0 0 0 0
Caveat
Note that above code assumes that the intervals [eksd, ENDDATE)
for each ID
do not overlap. This can be verified by
lsr[order(eksd), all(eksd - shift(ENDDATE, fill = 0) > 0), keyby = ID]
ID V1
1: 1 TRUE
2: 2 TRUE
3: 3 TRUE
In case there are overlaps, the above code can be modified to aggregate within the non-equi join using by = .EACHI
.
Benchmark
In another related question data.table by = xx How do i keep the groups of length 0 when i returns no match, the OP has pointed out that performance is crucial due to the size of his production data.
According to OP's comment, lsr
has 20 mio rows and 12 columns, the adherence
dataset, that I'm trying not to use has 1,5 mio rows of 2 columns. In another question, the OP mentions that lsr
is a few hundred mio. rows.
@minem has responded to this by providing a benchmark in his answer. We can use this benchmark data to compare the different answers.
# create benchmark data
lsr <- data.frame(
ID = c("1", "1", "1", "2", "2", "2", "3", "3"),
eksd = as.Date(c("2012-03-01", "2012-08-02", "2013-01-06","2012-08-25", "2013-03-22", "2013-09-15", "2011-01-01", "2013-01-05")),
DDD = as.integer(c("60", "90", "90", "60", "120", "60", "30", "90")),
stringsAsFactors = FALSE)
lsr$ENDDATE <- lsr$eksd + lsr$DDD
n <- 5e4
lsr2 <- lapply(1:n, function(x) lsr)
lsr2 <- rbindlist(lsr2, use.names = T, fill = T, idcol = T)
lsr2[, ID := as.integer(paste0(.id, ID))]
Thus, the benchmark dataset consists of 400 k rows and 150 k unique ID
s:
lsr2[, .(.N, uniqueN(ID))]
N V2
1: 400000 150000
# pull data preparation out of the benchmark
lsr2i <- copy(lsr2)[, eksd := as.IDate(eksd)][, ENDDATE := as.IDate(ENDDATE)]
lsr2[, ENDDATE2 := as.numeric(ENDDATE)]
# define date series
dates <- seq(as.Date("2013-02-01"), length.out = 193, by = "month")
idates <- seq(as.IDate("2013-02-01"), length.out = 193, by = "month")
# run benchmark
library(microbenchmark)
bm <- microbenchmark(
minem = {
dt <- copy(lsr2)
xtot <- lapply(dates, function(d) {
d <- as.numeric(d)
x <- dt[eksd <= d & ENDDATE > d, sum(ENDDATE2 - d), keyby = ID]
uid <- unique(dt$ID)
id2 <- setdiff(uid, x$ID)
id2 <- uid[!(uid %in% x$ID)]
if (length(id2) > 0) {
x2 <- data.table(ID = id2, V1 = 0)
x <- rbind(x, x2)
}
setkey(x, ID)
x
})
for (x in seq_along(xtot)) {
setnames(xtot[[x]], c("ID", paste0("V", x)))
}
xtot <- Reduce(function(...) merge(..., all = TRUE, by = "ID"), xtot)
xtot
},
uwe = {
dt <- copy(lsr2i)
CJ(ID = dt$ID, date = idates, unique = TRUE)[, AH := 0L][
dt, on = .(ID, date >= eksd, date < ENDDATE),
AH := as.integer(ENDDATE - x.date)][, dcast(.SD, ID ~ date)]
},
times = 1L
)
print(bm)
The result for one run shows that the non-equi join is more than 4 times faster than the lapply()
approach.
Unit: seconds
expr min lq mean median uq max neval
minem 27.654703 27.654703 27.654703 27.654703 27.654703 27.654703 1
uwe 5.958907 5.958907 5.958907 5.958907 5.958907 5.958907 1