I'm trying to enrich one dataset (adherence) based on subsets from another (lsr). For each individual row in adherence, I want to calculate (as a third column) the medication available for implementing the prescribed regimen. I have a function that returns the relevant result, but it runs for days on just a subset of the total data I have to run it on.
The datasets are:
library(dplyr)
library(tidyr)
library(lubridate)
library(data.table)
adherence <- cbind.data.frame(c("1", "2", "3", "1", "2", "3"), c("2013-01-01", "2013-01-01", "2013-01-01", "2013-02-01", "2013-02-01", "2013-02-01"))
names(adherence)[1] <- "ID"
names(adherence)[2] <- "year"
adherence$year <- ymd(adherence$year)
lsr <- cbind.data.frame(
c("1", "1", "1", "2", "2", "2", "3", "3"), #ID
c("2012-03-01", "2012-08-02", "2013-01-06","2012-08-25", "2013-03-22", "2013-09-15", "2011-01-01", "2013-01-05"), #eksd
c("60", "90", "90", "60", "120", "60", "30", "90") # DDD
)
names(lsr)[1] <- "ID"
names(lsr)[2] <- "eksd"
names(lsr)[3] <- "DDD"
lsr$eksd <- as.Date((lsr$eksd))
lsr$DDD <- as.numeric(as.character(lsr$DDD))
lsr$ENDDATE <- lsr$eksd + lsr$DDD
lsr <- as.data.table(lsr)
adherence <- as.data.table(adherence)
I'm used to working with dplyr, but it was much slower and I rewrote things for data.table to try it out. It is driving me crazy that my colleagues working with SAS claims that this wouldn't take long for them, when it takes me hours just to load the data itself into RAM. (fread crashes R on several of my datasets). Adherence is 1,5 mio rows, and lsr is a few hundred mio. rows.
My working function is
function.AH <- function(x) {
lsr[ID == x[1] & eksd <= x[2] & ENDDATE > x[2], ifelse(.N == 0, 0, sum(as.numeric(ENDDATE - as.Date(x[2]))))]
}
setkey(lsr, ID, eksd, ENDDATE)
adherence$AH <-apply (adherence, 1, FUN = function.AH) #DESIRED OUTPUT
I don't know the best approach: I've looked into using a SQL database, but as I understand it this shouldn't be faster when my data fits into RAM (I have 256GB). Since the adherence data.table is actually each individual ID repeated for 500 timeperiods (i.e. ID 1: at time 1, time 2, time 3...time 500, ID 2: at time 1, time 2... etc.)I also considered using the by function on ID on lsr and some how imbedding this time interval (1:500) in the function in j.
I hope that some-one can point out how I'm using the apply function inefficiently by not somehow applying it inside the data.table-framework and thus loosing the build in efficiency. But as I'm going to be working with this data and similar sizes of data, I'd appreciate any specific suggestions for solving this faster or general suggestions for getting faster running times using other methods.