Benchmarks along with an even faster data.table
solution
First, I echo @enmyj and @zach that the solution in the accepted answer gives erroneous results when there is one range that is completely inside another.
A faster approach that is reminiscent of the one proposed in the accepted answer:
- Sort by
ID
and then all dates (start
and end
combined).
- Subtract the cumulative sum of number of start dates by the cumulative sum of number of end dates.
- Find the indices where this sum is
0
. The dates on these rows are the end dates of each collection of overlapping date ranges. The dates on the next row are the start dates of the next collection of overlapping date ranges. The indices can also be used to easily perform roll-up calculations of other columns.
This involves just a few vectorized calls and no grouping operations, so it is very performant.
As a function:
flatten <- function(dt) {
setorder(dt[, rbindlist(.(.(ID, start, 1L), .(ID, end, -1L)))], V1, V2)[
, .(
ID = V1[i <- which(!cumsum(V3))],
start = V2[c(1L, i[-length(i)] + 1L)],
end = V2[i]
)
]
}
Benchmarks
The benchmarking uses a large-ish data.table
.
library(data.table)
library(dplyr)
library(ivs)
data <- data.table(
ID = sample(1e3, 1e5, 1),
start = as.Date(sample(1e4:2e4, 1e5, 1), origin = "1970-01-01")
)[, end := start + sample(100)]
fCum <- function(dt) {
# adapted from https://stackoverflow.com/a/47337684/9463489
dt %>%
arrange(ID, start) %>%
group_by(ID) %>%
mutate(indx = c(0, cumsum(as.numeric(lead(start)) >
cummax(as.numeric(end)))[-n()])) %>%
group_by(ID, indx) %>%
reframe(start = min(start), end = max(end)) %>%
select(-indx)
}
fivs <- function(dt) {
# adapted from https://stackoverflow.com/a/71754454/9463489
dt %>%
mutate(interval = iv(start, end), .keep = "unused") %>%
group_by(ID) %>%
reframe(interval = iv_groups(interval)) %>%
mutate(start = iv_start(interval), end = iv_end(interval)) %>%
select(-interval)
}
squish <- function(dt) {
# adapted from https://stackoverflow.com/a/53890653/9463489
setkey(dt, ID, start, end)
dt[,.(START_DT = start,
END_DT = end,
indx = c(0, cumsum(as.numeric(lead(start)) > cummax(as.numeric(end)))[-.N])),
keyby=ID
][,.(start=min(START_DT),
end = max(END_DT)),
by=c("ID","indx")
][, indx := NULL]
}
Timings:
microbenchmark::microbenchmark(
flatten = flatten(dt),
fCum = setDT(fCum(dt)),
fivs = setDT(fivs(dt)),
squish = squish(dt),
times = 10,
check = "equal",
setup = {dt <- copy(data)}
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> flatten 11.4732 11.8141 13.86760 12.36580 15.9228 19.1775 10
#> fCum 1827.1197 1876.7701 1898.24285 1908.88640 1926.6548 1939.2919 10
#> fivs 160.2568 163.9617 173.31783 173.32095 177.3789 192.7755 10
#> squish 62.5197 64.9126 66.26047 65.08515 67.1685 70.9916 10
Aggregating other columns
The approach used by flatten
also makes it easy to aggregate other columns in the data.table
.
data[, v := runif(1e5)]
setorder(data[, rbindlist(.(.(ID, start, 1L, 0), .(ID, end, -1L, v)))], V1, V2)[
, .(
ID = V1[i <- which(!cumsum(V3))],
start = V2[c(1L, i[-length(i)] + 1L)],
end = V2[i],
v = diff(c(0, cumsum(V4)[i]))
)
]
#> ID start end v
#> 1: 1 1997-09-25 1997-09-27 0.40898255
#> 2: 1 1997-11-09 1997-11-30 0.44067634
#> 3: 1 1998-04-27 1998-07-17 1.73142460
#> 4: 1 1999-08-05 1999-11-05 0.41103832
#> 5: 1 1999-12-09 2000-01-26 0.90639735
#> ---
#> 60286: 1000 2023-01-06 2023-03-28 0.54727106
#> 60287: 1000 2023-07-20 2023-10-16 1.74270130
#> 60288: 1000 2024-03-24 2024-06-23 0.07110824
#> 60289: 1000 2024-07-13 2024-07-31 0.63888263
#> 60290: 1000 2024-10-02 2024-10-19 0.22872167