0

In R, I have a set of time intervals, some of which overlap each other, and those overlaps can form chains of overlaps (interval A overlaps B, B overlaps C, but A does not overlap C). I want to find the minimum set of intervals that covers this set of intervals.

I have a solution using lubridate intervals, but it uses older paradigms, such as pushing and popping a stack of intervals. That solution is below. I am wondering if I am missing a simpler functional solution or package that should be doing this for me (I am worried that my code is fragile, and would rather use a tried and tested solution).


suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(lubridate))

# In this table of intervals, rows 3,4, and 5 form a chain of intervals.  They should be rolled into 1 interval.
# And note rows 3 and 5 do not themselves overlap, but they are chained together by having overlap with row 4.
dat <- read.csv(text="
start,end
2017-09-01 00:00,2017-09-01 00:01
2017-09-01 00:02,2017-09-01 00:03
2017-09-01 00:04,2017-09-01 00:08
2017-09-01 00:07,2017-09-01 00:15
2017-09-01 00:09,2017-09-01 00:16
2017-09-01 00:20,2017-09-01 00:22") %>%
  transmute(
    gtStart = ymd_hm(start)
    , gtEnd = ymd_hm(end))


iv_clean <- list()
iv_process <- interval(dat$gtStart, dat$gtEnd)

while(length(iv_process) > 0) {
  e <- iv_process[1]
  iv_process <- iv_process[-1]

  ## If e is last item in iv_process, add it to iv_clean and stop processing
  if (!length(iv_process)) {
    if (!length(iv_clean)) {
      iv_clean <- e
    } else {
      iv_clean <- c(e, iv_clean)
    }
    break
  }

  ## For every remaining interval that overlaps e, union it with e
  ## And trip a flag that says that we found an overlapping interval
  e_nonoverlapping <- TRUE
  for (i in 1:length(iv_process)) {
    if (int_overlaps(e, iv_process[i])) {
      e_nonoverlapping <- FALSE
      iv_process[i] <- union(e, iv_process[i])
    }
  }

  ## If e did not overlap with any interval, then add it to iv_clean
  ## Otherwise, don't, and continue processing iv_process
  if (e_nonoverlapping) {
    if (!length(iv_clean)) {
      iv_clean <- e
    } else {
      iv_clean <- c(e, iv_clean)
    }
  }
}


## Print result
print(iv_clean)
#> [1] 2017-09-01 00:20:00 UTC--2017-09-01 00:22:00 UTC
#> [2] 2017-09-01 00:04:00 UTC--2017-09-01 00:16:00 UTC
#> [3] 2017-09-01 00:02:00 UTC--2017-09-01 00:03:00 UTC
#> [4] 2017-09-01 00:00:00 UTC--2017-09-01 00:01:00 UTC
mpettis
  • 3,222
  • 4
  • 28
  • 35

1 Answers1

1

I would do this recursively/quasi-functionally:

#finds the overlap end points
get_overlap<-function(start, end, dat){
  #which ones start before the base case ends?
  overlap<- which(dat$gtStart < end)

  if(length(overlap) == 1){
    return(list(start = start, end = end ))
  }

  else{

    #if we have more than 1 event in our overlap, find the new end point
    #drop the first row and recurse until we find the end of the interval.
    end<-max(dat[overlap,]$gtEnd)
    return(get_overlap(start, end, dat[-1,]))
  }
}

#walks through the df and find the intervals. assumes the df is sorted as your example.
recur<-function(dat, intervals){
  #base case
  if(nrow(dat) == 0){
    return(intervals)
  }

  start <-dat[1,]$gtStart
  end<- dat[1,]$gtEnd

  indices<-get_overlap(start, end, dat)

  end_row<-which(dat$gtEnd == indices$end)

  intervals[[length(intervals)+1]]<-list(
    start = dat[1,]$gtStart,
    end = indices$end,
    n_events = nrow(dat[1:end_row,]),
    dat = dat[1:end_row,])

  #remove the events from the last interval and recurse
  return(recur(dat[-(1:end_row),], intervals))
}

intervals<-recur(dat, list())

If you have massive data, doing something like this in R is hit or miss. There is a recursion limit, which I believe is defaulted to 5000. If there is something amiss in the code, if will hit that pretty quickly. I think pythons stack depth is 1000, for reference.

You can mess with the recursion limit with options(expressions = <some number>). But be careful here, these things can chew through memory pretty quickly.

Bryan Goggin
  • 2,449
  • 15
  • 17
  • You might even be able to get away with `return(get_overlap(start, end, dat[dat$gtEnd< end,]))` in `get_overlap()`. Should be more efficient. – Bryan Goggin Oct 13 '17 at 01:49
  • Thanks! I'll give it a try. One thing about my version is that there is no recursion, so no chance to blow the stack. But that doesn't mean that recursion isn't appropriate, as the overlaps may be few enough. I was thinking though that there might be packages solving my exact problem, but I'm not googling for them correctly. – mpettis Oct 13 '17 at 15:53