1

I've tried the various answers so far here:

Some work but are not very performant for very large datasets (8m-12m rows)

Just some sample code of what I've been trying:

library(tidyverse)
library(data.table)

size = 10000

df <- data.frame(
  ID = sample(1:round(size / 5, 0)),
  period = sample(c(5,10,30,45), size, replace = TRUE),
  start = sample(seq(
    as.Date('1999/01/01'), as.Date('2000/01/01'), by = "day"
  ), size, replace = TRUE)
) %>% mutate(end = start + period)


dt <-
  data.table(df, key = c("start", "end"))[, `:=`(row = 1:nrow(df))]

overlapping <-
  unique(foverlaps(dt, dt)[ID == i.ID & row != i.row, ID])
dt[, `:=`(Overlap = FALSE)][ID %in% overlapping, Overlap :=
                                          TRUE][order(ID, start)] %>% 
  distinct(ID,Overlap) %>% 
  count(Overlap) %>% 
  mutate(freq = n/sum(n))

This one works fine but if the dataset gets bigger it's either slow or there is a negative vector error:

Error in foverlaps(dt, dt) : negative length vectors are not allowed

Is there a better way?

SCDCE
  • 1,603
  • 1
  • 15
  • 28

1 Answers1

2

You could directly join by ID in foverlaps and count number of overlaps :

size = 1e5

df <- data.frame(
  ID = sample(1:round(size / 5, 0)),
  period = sample(c(5,10,30,45), size, replace = TRUE),
  start = sample(seq(
    as.Date('1999/01/01'), as.Date('2000/01/01'), by = "day"
  ), size, replace = TRUE)
) %>% mutate(end = start + period)

dt <- data.table(df, key = c("start", "end"))[, `:=`(row = 1:nrow(df))]

setkey(dt,ID,start,end)
foverlaps(dt,dt,by.x=c("ID","start","end"),by.y=c("ID","start","end"))[
         ,.(noverlap=.N),by=.(ID,row)][
         ,.(overlap = max(noverlap>1)),by=ID][
         ,.(n=.N),by=.(overlap)][
         ,pct:=n/sum(n)][]

   Overlap    n   freq
1:   FALSE  547 0.2735
2:    TRUE 1453 0.7265

Performance comparison :

microbenchmark::microbenchmark(old(),new())
Unit: milliseconds
  expr      min       lq      mean   median        uq       max neval
 old() 672.6338 685.8825 788.78851 694.7804 864.95855 1311.9752   100
 new()  16.9942  17.7659  24.66032  18.7095  20.59965   63.3928   100
Waldi
  • 39,242
  • 6
  • 30
  • 78