data.table by = xx How do i keep the groups of length 0 when i returns no match

Question

I'm trying to work around a problem that has arisen due to the size of my data and that I haven't been able to find an answer to. ( i.e. Data.table: how to get the blazingly fast subsets it promises and apply to a second data.table)

This is the dummy data.

library(dplyr)
library(tidyr)
library(lubridate)
library(data.table)

adherence <- cbind.data.frame(c("1", "2", "3", "1", "2", "3"), c("2013-01-01", "2013-01-01", "2013-01-01", "2013-02-01", "2013-02-01", "2013-02-01"))
names(adherence)[1] <- "ID" 
names(adherence)[2] <- "year"
adherence$year <- ymd(adherence$year)

lsr <- cbind.data.frame(
  c("1", "1", "1", "2", "2", "2", "3", "3"), #ID
  c("2012-03-01", "2012-08-02", "2013-01-06","2012-08-25", "2013-03-22", "2013-09-15", "2011-01-01", "2013-01-05"), #eksd
  c("60", "90", "90", "60", "120", "60", "30", "90") # DDD
)
names(lsr)[1] <- "ID"
names(lsr)[2] <- "eksd"
names(lsr)[3] <- "DDD"

lsr$eksd <- as.Date((lsr$eksd))
lsr$DDD <- as.numeric(as.character(lsr$DDD))
lsr$ENDDATE <- lsr$eksd + lsr$DDD
lsr <- as.data.table(lsr)

adherence <- as.data.table(adherence)

I have tried different methods for achieving the result: a cartesian join gives me more than 2*31 rows and won't work. I rewrote everything in data.table and it literally reduced the run speed by days. I've found that if I can get this line to return the desired result I can create a for loop that looks at the "2013-02-01" and 500 other timepoints and achieve my dream (of continuing to another issue). One subset below only takes 15s on my data (so I could run it all in a few hours), but my problem is that it returns only groups with a valued subset. ID:2 is not returned, I think, because the group has no match in i. - reducing the time spend on the operation.

lsr[eksd <= as.Date("2013-02-01") & ENDDATE > as.Date("2013-02-01"), sum(as.numeric(ENDDATE - as.Date("2013-02-01"))), keyby = ID]


    ID V1
1:  1 64
2:  3 63

Under most circumstances that is clever, but I need the information about the groups with length = 0. (or whatever value - I just need no to drop the ID information). Somehow like this:

I tried using the tidyr::complete function (as explained here: dplyr summarise: Equivalent of ".drop=FALSE" to keep groups with zero length in output) , but dplyr is way too slow. It takes 7 hours on 0,2% of my data. I'm sure this can be achieved somehow. Any suggestions are welcome and appreciated.

@MichaelChirico - the three row table following "Somehow like this:" — Henry, Jan 19 '18 at 09:02
@minem my lsr has 20 mio rows and 12 columns, the adherence dataset, that I'm trying not to use has 1,5 mio rows of 2 columns. — Jakn09ab, Jan 19 '18 at 10:10

minem · Answer 1 · 2018-01-19T10:39:28.833

For speed reason I would suggest that you stick with your first approach and simply add necessary zeros:

by_minem <- function(dt = lsr2) {
  x <- dt[eksd <= as.Date("2013-02-01") & ENDDATE > as.Date("2013-02-01"),
          sum(as.numeric(ENDDATE - as.Date("2013-02-01"))), keyby = ID]
  uid <- unique(dt$ID)
  id2 <- uid[!(uid %in% x$ID)]
  x2 <- data.table(ID = id2, V1 = 0)
  x <- rbind(x, x2)
  setkey(x, ID)
  x
}
by_minem(lsr)
#    ID V1
# 1:  1 64
# 2:  2  0
# 3:  3 63

Test on larger data:

#Create larger data:
n <- 5e4
lsr2 <- lapply(1:n, function(x) lsr)
lsr2 <- rbindlist(lsr2, use.names = T, fill = T, idcol = T)
lsr2[, ID := as.integer(paste0(.id, ID))]
lsr2[, .(.N, uniqueN(ID))]
#         N     V2
# 1: 400000 150000

by_henry <- function(dt = lsr2) {
  dt[, sum((eksd <= as.Date("2013-02-01") & ENDDATE > as.Date("2013-02-01")) *
            as.numeric(ENDDATE - as.Date("2013-02-01"))), keyby = ID]
}

system.time(r1 <- by_henry()) #92.53
system.time(r2 <- by_minem()) #21.73
92.53/21.73 #4 times faster
all.equal(r1, r2)
# [1] TRUE

Update

And this would be even faster:

    by_minem2 <- function(dt = lsr2) {
  d <- as.numeric(as.Date("2013-02-01"))
  dt[, ENDDATE2 := as.numeric(ENDDATE)]
  x <- dt[eksd <= d & ENDDATE > d, sum(ENDDATE2 - d), keyby = ID]
  uid <- unique(dt$ID)
  id2 <- setdiff(uid, x$ID)
  id2 <- uid[!(uid %in% x$ID)]
  x2 <- data.table(ID = id2, V1 = 0)
  x <- rbind(x, x2)
  setkey(x, ID)
  x
}

system.time(r2 <- by_minem2()) #0.13

Very interesting. Perhaps you would consider that I need to repeat this for 200 different dates. Would you then write a for loop? I worked on this for Henrys answer: time.months <- as.Date("2013-02-01")+(365.25/12)*(0:10) adhlist = data.table() d <- as.Date("2013-02-01") for (d in min(time.months):max(time.months)) { A <- cbind(d, lsr[, sum((eksd <= d & ENDDATE > d) * as.numeric(ENDDATE - d)), keyby = ID]) adhlist <- rbind(adhlist, A) } — Jakn09ab, Jan 19 '18 at 10:38
@Jakn09ab maybe you could create a new question about speeding up that loop... and copy the link here, in comment. — minem, Jan 19 '18 at 10:47
I just did: https://stackoverflow.com/questions/48339884/building-efficient-for-loop-for-function I'll try to make the function work, but I can't figure it out. — Jakn09ab, Jan 19 '18 at 11:16

score 2 · Answer 2 · answered Jan 19 '18 at 08:35

2

The problem is that you are removing all cases of ID being 2 in the selection process.

As an alternative you can put the selection inside the sum, for example

lsr[, sum((eksd <= as.Date("2013-02-01") & ENDDATE > as.Date("2013-02-01")) *
           as.numeric(ENDDATE - as.Date("2013-02-01"))), keyby = ID]

to give

answered Jan 19 '18 at 08:35

Henry

6,704
2
23
39

This is great. It works (though at 70 seconds). Would you care explaining to me what the star (*) does in the code? – Jakn09ab Jan 19 '18 at 08:51
1

@Jakn09ab It is simple multiplication. The expression before the * is TRUE or FALSE, and gets coerced into 1 or 0 in the calculation. The inefficiency here is that `as.numeric(ENDDATE - as.Date("2013-02-01"))` is calculated for the whole data table, not just the selected observations – Henry Jan 19 '18 at 08:56
haha. That's so obvious and wonderful. :-) Now I just need to figure out the for loop.. – Jakn09ab Jan 19 '18 at 09:06

score 2 · Accepted Answer · answered Jan 23 '18 at 12:03

The OP has asked how to fill in the missing IDs which were dropped during the previous aggregation.

Without considering performance issues associated with OP'S aggregation code, one method to complete the IDs is to join with the unique IDs, directly chained with the previous operation:

uid <- sort(unique(lsr$ID))
# OP's code
lsr[eksd <= as.Date("2013-02-01") & ENDDATE > as.Date("2013-02-01"), 
    sum(as.numeric(ENDDATE - as.Date("2013-02-01"))), keyby = ID][
      # chained with join to complete IDs
      .(ID = uid), on = "ID"][is.na(V1), V1 := 0][]

data.table by = xx How do i keep the groups of length 0 when i returns no match

3 Answers3

Update

Linked