Find the duration of overlap between pairs of timestamps using data.table

Question

Similar to this question, I'd like to find the duration of overlap between pairs of timestamps using data.table.

Here's my current code:

library(data.table)

DT <- fread(
  "stage,ID,date1,date2
  1,A,2018-04-17 00:00:00,2018-04-17 01:00:00
  1,B,2018-04-17 00:00:00,2018-04-17 00:20:00
  1,C,2018-04-17 00:15:00,2018-04-17 01:00:00
  2,B,2018-04-17 00:30:00,2018-04-17 01:10:00
  2,D,2018-04-17 00:30:00,2018-04-17 00:50:00",
  sep = ","
)

cols <- c("date1", "date2")
DT[, (cols) := lapply(.SD, as.POSIXct), .SDcols = cols]

breaks <- DT[, {
  tmp <- unique(sort(c(date1, date2)))
  .(start = head(tmp, -1L), end = tail(tmp, -1L))
}, by = stage]

result <- DT[breaks, on = .(stage, date1 <= start, date2 >= end), paste(ID, collapse = "+"),  
    by = .EACHI, allow.cartesian = T] %>% 
  mutate(lengthinseconds = as.numeric(difftime(date2, date1, units = "secs")))

Which returns:

  stage               date1               date2    V1 lengthinseconds
1     1 2018-04-17 00:00:00 2018-04-17 00:15:00   B+A             900
2     1 2018-04-17 00:15:00 2018-04-17 00:20:00 B+A+C             300
3     1 2018-04-17 00:20:00 2018-04-17 01:00:00   A+C            2400
4     2 2018-04-17 00:30:00 2018-04-17 00:50:00   D+B            1200
5     2 2018-04-17 00:50:00 2018-04-17 01:10:00     B            1200

But I'd like to return only overlaps between user dyads (i.e. no more than two overlapping users). There are several hacky ways I can think of achieve this, such as:

library(dplyr)
library(tidyr)

result %>% 
  filter(nchar(V1)==3) %>% 
  tidyr::separate(V1, c("ID1", "ID2"))

Which returns:

  stage               date1               date2 ID1 ID2 lengthinseconds
1     1 2018-04-17 00:00:00 2018-04-17 00:15:00   B   A             900
2     1 2018-04-17 00:20:00 2018-04-17 01:00:00   A   C            2400
3     2 2018-04-17 00:30:00 2018-04-17 00:50:00   D   B            1200

But this seems inelegant, especially when dealing with longer ID strings and potentially hundreds of IDs per overlap.

Ideally, I'd like to know if there's a way to modify the original data.table code to return this directly.

Btw, your title and first sentence mention duration but it seems unrelated to your output. Also, you might want to sort before pasting the IDs for more predictable output.. — Frank, Apr 17 '18 at 14:35
Good point, I've added the calculation of duration to my question. Yes I've been sorting the two columns `ID1` and `ID2` alphabetically within each row further downstream but seems easier to sort before the `paste` command. — jogall, Apr 17 '18 at 14:46

score 3 · Answer 1 · answered Apr 17 '18 at 14:41

Another possibility:

DT[breaks, on = .(stage, date1 <= start, date2 >= end)
   ][, if (uniqueN(ID) == 2) .SD, by = .(stage, date1, date2)
     ][, dcast(.SD, stage + date1 + date2 ~ rowid(date1, prefix = 'ID'), value.var = 'ID')
       ][, lengthinseconds := as.numeric(difftime(date2, date1, units = "secs"))][]

which gives:

   stage               date1               date2 ID1 ID2 lengthinseconds
1:     1 2018-04-17 00:00:00 2018-04-17 00:15:00   B   A             900
2:     1 2018-04-17 00:20:00 2018-04-17 01:00:00   A   C            2400
3:     2 2018-04-17 00:30:00 2018-04-17 00:50:00   D   B            1200

Thank you @Jaap: it's useful to see other solutions to the same problem, plus an implementation of `dcast`. Seems to be plenty of ways to skin the cat using `data.table`! — jogall, Apr 17 '18 at 14:51

Uwe · Accepted Answer · 2018-04-17T14:53:16.533

At first glance (and neglecting performance considerations), this requires only a minor modification to OP's code:

result <- DT[breaks, on = .(stage, date1 <= start, date2 >= end), 
             if (.N == 2L) paste(ID, collapse = "+"),  
             by = .EACHI, allow.cartesian = TRUE]
result

   stage               date1               date2  V1
1:     1 2018-04-17 00:00:00 2018-04-17 00:15:00 B+A
2:     1 2018-04-17 00:20:00 2018-04-17 01:00:00 A+C
3:     2 2018-04-17 00:30:00 2018-04-17 00:50:00 D+B

Only for those groups, i.e., time ranges, where exactly two users are active a result row will be created.

The OP has requested to show the two IDs in separate columns plus to show the duration of the overlap. In addition, I suggest to have the IDs sorted.

result <- DT[breaks, on = .(stage, date1 <= start, date2 >= end), 
   if (.N == 2L) {
     tmp <- sort(ID)
     .(ID1 = tmp[1], ID2 = tmp[2], dur.in.sec = difftime(end, start, units = "secs"))
     },  
   by = .EACHI, allow.cartesian = TRUE]
result

   stage               date1               date2 ID1 ID2 dur.in.sec
1:     1 2018-04-17 00:00:00 2018-04-17 00:15:00   A   B   900 secs
2:     1 2018-04-17 00:20:00 2018-04-17 01:00:00   A   C  2400 secs
3:     2 2018-04-17 00:30:00 2018-04-17 00:50:00   B   D  1200 secs

Thank @Uwe, this is great! Good motivation for me to learn more `data.table` and stop relying on only `dplyr` :) — jogall, Apr 17 '18 at 14:49

Find the duration of overlap between pairs of timestamps using data.table

2 Answers2

Linked