Data.table: how to get the blazingly fast subsets it promises and apply to a second data.table

Question

I'm trying to enrich one dataset (adherence) based on subsets from another (lsr). For each individual row in adherence, I want to calculate (as a third column) the medication available for implementing the prescribed regimen. I have a function that returns the relevant result, but it runs for days on just a subset of the total data I have to run it on.

The datasets are:

 library(dplyr)
library(tidyr)
library(lubridate)
library(data.table)

adherence <- cbind.data.frame(c("1", "2", "3", "1", "2", "3"), c("2013-01-01", "2013-01-01", "2013-01-01", "2013-02-01", "2013-02-01", "2013-02-01"))
names(adherence)[1] <- "ID" 
names(adherence)[2] <- "year"
adherence$year <- ymd(adherence$year)

lsr <- cbind.data.frame(
  c("1", "1", "1", "2", "2", "2", "3", "3"), #ID
  c("2012-03-01", "2012-08-02", "2013-01-06","2012-08-25", "2013-03-22", "2013-09-15", "2011-01-01", "2013-01-05"), #eksd
  c("60", "90", "90", "60", "120", "60", "30", "90") # DDD
)
names(lsr)[1] <- "ID"
names(lsr)[2] <- "eksd"
names(lsr)[3] <- "DDD"

lsr$eksd <- as.Date((lsr$eksd))
lsr$DDD <- as.numeric(as.character(lsr$DDD))
lsr$ENDDATE <- lsr$eksd + lsr$DDD
lsr <- as.data.table(lsr)

adherence <- as.data.table(adherence)

I'm used to working with dplyr, but it was much slower and I rewrote things for data.table to try it out. It is driving me crazy that my colleagues working with SAS claims that this wouldn't take long for them, when it takes me hours just to load the data itself into RAM. (fread crashes R on several of my datasets). Adherence is 1,5 mio rows, and lsr is a few hundred mio. rows.

My working function is

function.AH <- function(x) {
  lsr[ID == x[1] & eksd <= x[2] & ENDDATE > x[2], ifelse(.N == 0, 0, sum(as.numeric(ENDDATE - as.Date(x[2]))))]
}
setkey(lsr, ID, eksd, ENDDATE)
adherence$AH <-apply (adherence, 1,  FUN = function.AH) #DESIRED OUTPUT

I don't know the best approach: I've looked into using a SQL database, but as I understand it this shouldn't be faster when my data fits into RAM (I have 256GB). Since the adherence data.table is actually each individual ID repeated for 500 timeperiods (i.e. ID 1: at time 1, time 2, time 3...time 500, ID 2: at time 1, time 2... etc.)I also considered using the by function on ID on lsr and some how imbedding this time interval (1:500) in the function in j.

I hope that some-one can point out how I'm using the apply function inefficiently by not somehow applying it inside the data.table-framework and thus loosing the build in efficiency. But as I'm going to be working with this data and similar sizes of data, I'd appreciate any specific suggestions for solving this faster or general suggestions for getting faster running times using other methods.

Does your production data contain any overlapping periods in `lsr` for one particular `ID`? If not, then the call to `sum()` inside of `function.AH()` is not required because it is called once for each row of `adherence`. — Uwe, Jan 23 '18 at 09:19

score 1 · Accepted Answer · edited Jun 20 '20 at 09:12

This can be solved by updating in a non-equi join.

This avoids the memory issues caused by a cartesian join or by calling apply() which coerces a data.frame or data.table to a matrix which involves copying the data.

In addition, the OP has mentioned that lsr has a few hundred mio. rows and adherence has 1.5 mio rows (500 timeperiods times 3000 ID's). Therefore, efficient storage of data items will not only reduce the memory footprint but may also reduce the share of processing time which is required for loading data.

library(data.table)
# coerce to data.table by reference, i.e., without copying
setDT(adherence)
setDT(lsr)
# coerce to IDate to save memory
adherence[, year := as.IDate(year)]
cols <- c("eksd", "ENDDATE")
lsr[, (cols) := lapply(.SD, as.IDate), .SDcols = cols]
# update in a non-equi join
adherence[lsr, on = .(ID, year >= eksd, year < ENDDATE), 
                      AH := as.integer(ENDDATE - x.year)][]

   ID       year AH
1:  1 2013-01-01 NA
2:  2 2013-01-01 NA
3:  3 2013-01-01 NA
4:  1 2013-02-01 64
5:  2 2013-02-01 NA
6:  3 2013-02-01 63

Note that NA indicates that no match was found. If required, the AH column can be initialised before the non-equi join by adherence[, AH := 0L].

Data

The code to create the sample datasets can be streamlined:

adherence <- data.frame(
  ID = c("1", "2", "3", "1", "2", "3"), 
  year = as.Date(c("2013-01-01", "2013-01-01", "2013-01-01", "2013-02-01", "2013-02-01", "2013-02-01")),
  stringsAsFactors = FALSE)

lsr <- data.frame(
  ID = c("1", "1", "1", "2", "2", "2", "3", "3"),
  eksd = as.Date(c("2012-03-01", "2012-08-02", "2013-01-06","2012-08-25", "2013-03-22", "2013-09-15", "2011-01-01", "2013-01-05")),
  DDD = as.integer(c("60", "90", "90", "60", "120", "60", "30", "90")),
  stringsAsFactors = FALSE)
lsr$ENDDATE <- lsr$eksd + lsr$DDD

Note that DDD is of type integer which usually requires 4 bytes instead of 8 bytes for type numeric/double.

Also note that the last statement may cause the whole data object lsr to be copied. This can be avoided by using data.table syntax which updates by reference.

library(data.table)
setDT(lsr)[, ENDDATE := eksd + DDD][]

denis · Answer 2 · 2018-01-16T14:12:37.010

0

I am not sure why your function is slow (I think you could remove your ifelse function), but I would propose to use merge to be faster and to operate on one table only:

plouf <- lsr[adherence, on = "ID", allow.cartesian=TRUE]
plouf[,year := as.date(year)]
bob <- rbindlist(lapply(unique(adherence$year),function(x){
  plouf <- lsr[adherence[year == x], on = "ID"]
  plouf[,year := as.Date(year)]
  plouf[year >= eksd & year < ENDDATE,list(sum = sum(as.numeric(ENDDATE-as.Date(year))), year = year), by = ID]
  }))
bob

   ID sum       year
1:  1  64 2013-02-01
2:  3  63 2013-02-01

you can then merge to adherence

adherence <- setDT(adherence)
adherence[,year := as.Date(year)]
bob[adherence, on = .(ID,year)]
   ID sum       year
1:  1  NA 2013-01-01
2:  2  NA 2013-01-01
3:  3  NA 2013-01-01
4:  1  64 2013-02-01
5:  2  NA 2013-02-01
6:  3  63 2013-02-01

For reading your data use fread() function that is fast for big data

edited Jan 16 '18 at 14:12

answered Jan 15 '18 at 14:37

denis

5,580
1
13
40

Thanks for the suggestion. I'll try to reload my data, and understand your suggestion. Regarding fread(): for some reason it crashes Rstudio on most of my datasets. I'll get back to you when I have some experience with it. Thanks again. – Jakn09ab Jan 15 '18 at 20:25
I get this error on the first line (plouf <- ) Error in vecseq(f_, len_, if (allow.cartesian) NULL else as.integer(max(nrow(x), : Join results in more than 2^31 rows (internal vecseq reached physical limit). Very likely misspecified join. Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including j and dropping by (by-without-by) so that j runs for each group to avoid the large allocation... The rest seems to work. I think you hope to hold every combination in the RAM? which is what my first thought was, but it wasn't possi – Jakn09ab Jan 16 '18 at 08:33
ok you have really large data. I don't have the time right now, I will try another solution later. You will need to do several join as specified in the alert message – denis Jan 16 '18 at 08:40
I really appreciate that you'll give it a shot. I have been trying to solve this for months. – Jakn09ab Jan 16 '18 at 09:01
I changed the way. It doesn't use allow.cartesian = TRUE to avoid enormous merge, and use lapply to loop over the different date per id in adherence. Give it a try – denis Jan 16 '18 at 14:08
Just to be sure: I use the code as it is now, but changes allow.cartesian=TRUE to = FALSE? I tried either way bot both return the samme error. – Jakn09ab Jan 16 '18 at 18:17

Data.table: how to get the blazingly fast subsets it promises and apply to a second data.table

2 Answers2

Data

Linked