I've tried the various answers so far here:
- Combining IRanges objects and maintaining mcols
- Find all date ranges for overlapping start and end dates in R
- Find groups of overlapping intervals with data.table
- Finding all overlaps in one iteration of foverlap in R's data.table
- Find dates within a period interval by group
- R Find overlap among time periods
- Detect overlapping dates by group with R
Some work but are not very performant for very large datasets (8m-12m rows)
Just some sample code of what I've been trying:
library(tidyverse)
library(data.table)
size = 10000
df <- data.frame(
ID = sample(1:round(size / 5, 0)),
period = sample(c(5,10,30,45), size, replace = TRUE),
start = sample(seq(
as.Date('1999/01/01'), as.Date('2000/01/01'), by = "day"
), size, replace = TRUE)
) %>% mutate(end = start + period)
dt <-
data.table(df, key = c("start", "end"))[, `:=`(row = 1:nrow(df))]
overlapping <-
unique(foverlaps(dt, dt)[ID == i.ID & row != i.row, ID])
dt[, `:=`(Overlap = FALSE)][ID %in% overlapping, Overlap :=
TRUE][order(ID, start)] %>%
distinct(ID,Overlap) %>%
count(Overlap) %>%
mutate(freq = n/sum(n))
This one works fine but if the dataset gets bigger it's either slow or there is a negative vector error:
Error in foverlaps(dt, dt) : negative length vectors are not allowed
Is there a better way?