I'm trying to work around a problem that has arisen due to the size of my data and that I haven't been able to find an answer to. ( i.e. Data.table: how to get the blazingly fast subsets it promises and apply to a second data.table)
This is the dummy data.
library(dplyr)
library(tidyr)
library(lubridate)
library(data.table)
adherence <- cbind.data.frame(c("1", "2", "3", "1", "2", "3"), c("2013-01-01", "2013-01-01", "2013-01-01", "2013-02-01", "2013-02-01", "2013-02-01"))
names(adherence)[1] <- "ID"
names(adherence)[2] <- "year"
adherence$year <- ymd(adherence$year)
lsr <- cbind.data.frame(
c("1", "1", "1", "2", "2", "2", "3", "3"), #ID
c("2012-03-01", "2012-08-02", "2013-01-06","2012-08-25", "2013-03-22", "2013-09-15", "2011-01-01", "2013-01-05"), #eksd
c("60", "90", "90", "60", "120", "60", "30", "90") # DDD
)
names(lsr)[1] <- "ID"
names(lsr)[2] <- "eksd"
names(lsr)[3] <- "DDD"
lsr$eksd <- as.Date((lsr$eksd))
lsr$DDD <- as.numeric(as.character(lsr$DDD))
lsr$ENDDATE <- lsr$eksd + lsr$DDD
lsr <- as.data.table(lsr)
adherence <- as.data.table(adherence)
I have tried different methods for achieving the result: a cartesian join gives me more than 2*31 rows and won't work. I rewrote everything in data.table and it literally reduced the run speed by days. I've found that if I can get this line to return the desired result I can create a for loop that looks at the "2013-02-01" and 500 other timepoints and achieve my dream (of continuing to another issue). One subset below only takes 15s on my data (so I could run it all in a few hours), but my problem is that it returns only groups with a valued subset. ID:2 is not returned, I think, because the group has no match in i. - reducing the time spend on the operation.
lsr[eksd <= as.Date("2013-02-01") & ENDDATE > as.Date("2013-02-01"), sum(as.numeric(ENDDATE - as.Date("2013-02-01"))), keyby = ID]
ID V1
1: 1 64
2: 3 63
Under most circumstances that is clever, but I need the information about the groups with length = 0. (or whatever value - I just need no to drop the ID information). Somehow like this:
ID V1
1: 1 64
2: 2 0
3: 3 63
I tried using the tidyr::complete function (as explained here: dplyr summarise: Equivalent of ".drop=FALSE" to keep groups with zero length in output) , but dplyr is way too slow. It takes 7 hours on 0,2% of my data. I'm sure this can be achieved somehow. Any suggestions are welcome and appreciated.