R - reshape start-end time intervals to identify overlapping times

Question

Have medication time series data with repeated assessments for multiple subjects of a structure like this (using data.table):

library(data.table)
dt1 = setDT(structure(list(id = c("G", "G", "G", "G", "M", "M", "M", "M", 
  "M", "M", "M"), med = c("mult", "R", "mult", "R", "A", "mult", 
  "A", "C", "A", "Q", "A"), strt = c(19059L, 19061L, 19065L, 19066L, 
  19136L, 19138L, 19142L, 19142L, 19155L, 19246L, 19257L), end = c(19061L, 
  19065L, 19066L, 19101L, 19138L, 19139L, 19172L, 19172L, 19255L, 
  19276L, 19287L)), .Names = c("id", "med", "strt", "end"), row.names = c(NA, 
  -11L), class = "data.frame"))

producing data.table dt1:

    id  med  strt   end
 1:  G mult 19059 19061
 2:  G    R 19061 19065
 3:  G mult 19065 19066
 4:  G    R 19066 19101
 5:  M    A 19136 19138
 6:  M mult 19138 19139
 7:  M    A 19142 19172
 8:  M    C 19142 19172
 9:  M    A 19155 19255
10:  M    Q 19246 19276
11:  M    A 19257 19287

I am trying to reorganize data such that, for each subject, any day on which the patient was on >1 med is recoded as 'mult', and consecutive days of a given medication regimen are represented as a single row.

Thus, the desired result is dt2:

    id  med  strt   end
 1:  G mult 19059 19061
 2:  G    R 19062 19064
 3:  G mult 19065 19066
 4:  G    R 19067 19101
 5:  M    A 19136 19137
 6:  M mult 19138 19139
 7:  M mult 19142 19172
 8:  M    A 19173 19245
 9:  M mult 19246 19255
10:  M    Q 19256 19256
11:  M mult 19257 19276
12:  M    A 19277 19287

I have written the following code that does this, but it is slow and verbose. Could someone help me improve this?

dt2 = dt1[, list(id, med, day=seq(strt,end)), by=1:nrow(dt1)]
setkey(dt2,'id','day')
dt2[, med := ifelse(length(unique(med))>1, 'mult', med), by=list(id,day)]
dt2 = unique(dt2)
medrun <- function(y,z){
  cnt = grp = 1L
  lx = length(y)
  ne = y[-lx] != y[-1L]
  n1 = z[-lx] - z[-1L] != -1
  for(i in seq_along(ne)){if(ne[i] | n1[i])cnt=cnt+1; grp[i+1]=cnt}
  grp
}
dt2[,grp := as.numeric(medrun(med,day)), by=id]
setkey(dt2,'id','grp')
dt2[,strt := min(day), by=list(id,grp)]
dt2[,end := max(day), by=list(id,grp)]
dt2 = unique(dt2)
dt2 = subset(dt2, select = c('id','med','strt','end'))

Dataset is large (>3M rows), so solution needs to be memory efficient and fast. Ideally, would prefer not to expand the intervals into 1 obs/day.

for one thing, dont use `ifelse` (See http://stackoverflow.com/q/16275149/1492421 ) -- you dont need it in `data.table` Also, `setkey` on `dt2` earlier. Lastly, using `subset` on a `data.table` kills much of the point of `data.table` in the first place. — Ricardo Saporta, Sep 28 '14 at 03:59
@RicardoSaporta thanks. i edited a bit and will do more shortly. i kind of dashed out my solution and i know it has various inefficiencies. and while i do appreciate the general pointers (eg, didn't know that about `ifelse` in `data.table`), i am less interested in incremental improvement than in finding a fundamentally different and more efficient approach that doesn't require blowing out the intervals. — dnaiel, Sep 28 '14 at 04:12

Ricardo Saporta · Accepted Answer · 2014-09-29T04:17:43.603

1

I would expand by day, set key, then tally accordingly

DT_meds <- DT1[, list(day = if (.N ==1) seq(from=strt, to=end) else unlist(mapply(seq, strt, end))), keyby=list(id, med)]
setkey(DT_meds, id, day, med)
DT_meds[, med := if (length(unique(med)) > 1) "mult" else med, by=list(id, day)]
DT_meds[, grp := cumsum (c(FALSE, diff(day) > 1)), by=list(id, med) ]

DT_results <- DT_meds[, list(str=day[1L], end=day[.N]), by=list(id, med, grp)]
DT_results[, grp := NULL]

DT_results
#     id  med   str   end
#  1:  G mult 19059 19061
#  2:  G    R 19062 19064
#  3:  G mult 19065 19066
#  4:  G    R 19067 19101
#  5:  M    A 19136 19137
#  6:  M mult 19138 19139
#  7:  M mult 19142 19172
#  8:  M    A 19173 19245
#  9:  M mult 19246 19255
# 10:  M    Q 19256 19256
# 11:  M mult 19257 19276
# 12:  M    A 19277 19287

edited Sep 29 '14 at 04:17

answered Sep 28 '14 at 09:56

Ricardo Saporta

54,400
17
144
178

doesn't run. i think `kCols` needs to be defined? – dnaiel Sep 29 '14 at 01:20
after changing `kCols` to `list(id,med)` it runs correctly, so i edited your answer to fix this. i upvoted, but will keep question open for now in hope of an answer that doesn't require expansion by day. – dnaiel Sep 29 '14 at 01:34

R - reshape start-end time intervals to identify overlapping times

1 Answers1