3

Similar to this question, I'd like to find the duration of overlap between pairs of timestamps using data.table.

Here's my current code:

library(data.table)

DT <- fread(
  "stage,ID,date1,date2
  1,A,2018-04-17 00:00:00,2018-04-17 01:00:00
  1,B,2018-04-17 00:00:00,2018-04-17 00:20:00
  1,C,2018-04-17 00:15:00,2018-04-17 01:00:00
  2,B,2018-04-17 00:30:00,2018-04-17 01:10:00
  2,D,2018-04-17 00:30:00,2018-04-17 00:50:00",
  sep = ","
)

cols <- c("date1", "date2")
DT[, (cols) := lapply(.SD, as.POSIXct), .SDcols = cols]

breaks <- DT[, {
  tmp <- unique(sort(c(date1, date2)))
  .(start = head(tmp, -1L), end = tail(tmp, -1L))
}, by = stage]

result <- DT[breaks, on = .(stage, date1 <= start, date2 >= end), paste(ID, collapse = "+"),  
    by = .EACHI, allow.cartesian = T] %>% 
  mutate(lengthinseconds = as.numeric(difftime(date2, date1, units = "secs")))

Which returns:

  stage               date1               date2    V1 lengthinseconds
1     1 2018-04-17 00:00:00 2018-04-17 00:15:00   B+A             900
2     1 2018-04-17 00:15:00 2018-04-17 00:20:00 B+A+C             300
3     1 2018-04-17 00:20:00 2018-04-17 01:00:00   A+C            2400
4     2 2018-04-17 00:30:00 2018-04-17 00:50:00   D+B            1200
5     2 2018-04-17 00:50:00 2018-04-17 01:10:00     B            1200

But I'd like to return only overlaps between user dyads (i.e. no more than two overlapping users). There are several hacky ways I can think of achieve this, such as:

library(dplyr)
library(tidyr)

result %>% 
  filter(nchar(V1)==3) %>% 
  tidyr::separate(V1, c("ID1", "ID2"))

Which returns:

  stage               date1               date2 ID1 ID2 lengthinseconds
1     1 2018-04-17 00:00:00 2018-04-17 00:15:00   B   A             900
2     1 2018-04-17 00:20:00 2018-04-17 01:00:00   A   C            2400
3     2 2018-04-17 00:30:00 2018-04-17 00:50:00   D   B            1200

But this seems inelegant, especially when dealing with longer ID strings and potentially hundreds of IDs per overlap.

Ideally, I'd like to know if there's a way to modify the original data.table code to return this directly.

jogall
  • 651
  • 6
  • 21
  • 1
    Btw, your title and first sentence mention duration but it seems unrelated to your output. Also, you might want to sort before pasting the IDs for more predictable output.. – Frank Apr 17 '18 at 14:35
  • 1
    Good point, I've added the calculation of duration to my question. Yes I've been sorting the two columns `ID1` and `ID2` alphabetically within each row further downstream but seems easier to sort before the `paste` command. – jogall Apr 17 '18 at 14:46

2 Answers2

3

Another possibility:

DT[breaks, on = .(stage, date1 <= start, date2 >= end)
   ][, if (uniqueN(ID) == 2) .SD, by = .(stage, date1, date2)
     ][, dcast(.SD, stage + date1 + date2 ~ rowid(date1, prefix = 'ID'), value.var = 'ID')
       ][, lengthinseconds := as.numeric(difftime(date2, date1, units = "secs"))][]

which gives:

   stage               date1               date2 ID1 ID2 lengthinseconds
1:     1 2018-04-17 00:00:00 2018-04-17 00:15:00   B   A             900
2:     1 2018-04-17 00:20:00 2018-04-17 01:00:00   A   C            2400
3:     2 2018-04-17 00:30:00 2018-04-17 00:50:00   D   B            1200
Jaap
  • 81,064
  • 34
  • 182
  • 193
  • 1
    Thank you @Jaap: it's useful to see other solutions to the same problem, plus an implementation of `dcast`. Seems to be plenty of ways to skin the cat using `data.table`! – jogall Apr 17 '18 at 14:51
2

At first glance (and neglecting performance considerations), this requires only a minor modification to OP's code:

result <- DT[breaks, on = .(stage, date1 <= start, date2 >= end), 
             if (.N == 2L) paste(ID, collapse = "+"),  
             by = .EACHI, allow.cartesian = TRUE]
result
   stage               date1               date2  V1
1:     1 2018-04-17 00:00:00 2018-04-17 00:15:00 B+A
2:     1 2018-04-17 00:20:00 2018-04-17 01:00:00 A+C
3:     2 2018-04-17 00:30:00 2018-04-17 00:50:00 D+B

Only for those groups, i.e., time ranges, where exactly two users are active a result row will be created.


The OP has requested to show the two IDs in separate columns plus to show the duration of the overlap. In addition, I suggest to have the IDs sorted.

result <- DT[breaks, on = .(stage, date1 <= start, date2 >= end), 
   if (.N == 2L) {
     tmp <- sort(ID)
     .(ID1 = tmp[1], ID2 = tmp[2], dur.in.sec = difftime(end, start, units = "secs"))
     },  
   by = .EACHI, allow.cartesian = TRUE]
result
   stage               date1               date2 ID1 ID2 dur.in.sec
1:     1 2018-04-17 00:00:00 2018-04-17 00:15:00   A   B   900 secs
2:     1 2018-04-17 00:20:00 2018-04-17 01:00:00   A   C  2400 secs
3:     2 2018-04-17 00:30:00 2018-04-17 00:50:00   B   D  1200 secs
Uwe
  • 41,420
  • 11
  • 90
  • 134
  • Thank @Uwe, this is great! Good motivation for me to learn more `data.table` and stop relying on only `dplyr` :) – jogall Apr 17 '18 at 14:49