3

I am looking for an implementation of union for time intervals which is capable of dealing with unions that are not themselves intervals.

I have noticed lubridate includes a union function for time intervals but it always returns a single interval even if the union is not an interval (ie it returns the interval defined by the minimum of both start dates and the maximum of both end dates, ignoring intervening periods not covered by either interval):

library(lubridate)
int1 <- new_interval(ymd("2001-01-01"), ymd("2002-01-01"))
int2 <- new_interval(ymd("2003-06-01"), ymd("2004-01-01"))
union(int1, int2)
# Union includes intervening time between intervals.
# [1] 2001-01-01 UTC--2004-01-01 UTC

I have also looked at the interval package, but its documentation makes no reference to union.

My end goal is to use the complex union with %within%:

my_int %within% Reduce(union, list_of_intervals)

So if we consider a concrete example, suppose the list_of_intervals is:

[[1]] 2000-01-01 -- 2001-01-02 
[[2]] 2001-01-01 -- 2004-01-02 
[[3]] 2005-01-01 -- 2006-01-02 

Then my_int <- 2001-01-01 -- 2004-01-01 is not %within% the list_of_intervals so it should return FALSE and my_int <- 2003-01-01 -- 2006-01-01 is so it should be TRUE.

However, I suspect the complex union has more uses than this.

orizon
  • 3,159
  • 3
  • 25
  • 30
  • 1
    what is your desired output using the example you gave? – RJ- Feb 15 '13 at 07:48
  • @RJ I have added a concrete example with expected output. – orizon Feb 15 '13 at 08:21
  • @orizon I'm a bit confused with your example : why should you get `TRUE` for the first `my_int` but not for te second one ? The first one is not included in one of the intervals of your list ? – juba Feb 15 '13 at 08:42
  • @juba you are right I shouldn't. I made a mistake in the example. I have edited. The difference between the cases is that the first two intervals overlap but the second two do not. – orizon Feb 15 '13 at 08:47
  • Ultimately I used the [IRanges](http://www.bioconductor.org/packages/release/bioc/html/IRanges.html) package on bioconductor. This did require some fiddling. – orizon Apr 02 '14 at 07:04

2 Answers2

3

If I understand your question correctly, you'd like to start with a set of potentially overlapping intervals and obtain a list of intervals that represents the UNION of the input set, rather than just the single interval spanning the mininum and maximum of the input set. This is the same question I had.

A similar question was asked at: Union of intervals

... but the accepted response fails with overlapping intervals. However, hosolmaz (I am new to SO, so don't know how to link to this user) posted a modification (in Python) that fixes the issue, which I then converted to R as follows:

library(dplyr) # for %>%, arrange, bind_rows

interval_union <- function(input) {
  if (nrow(input) == 1) {
    return(input)
  }
  input <- input %>% arrange(start)
  output = input[1, ]
  for (i in 2:nrow(input)) {
    x <- input[i, ]
    if (output$stop[nrow(output)] < x$start) {
      output <- bind_rows(output, x)
    } else if (output$stop[nrow(output)] == x$start) {
      output$stop[nrow(output)] <- x$stop
    }
    if (x$stop > output$stop[nrow(output)]) {
      output$stop[nrow(output)] <- x$stop
    }
  }
  return(output)
}

With your example with overlapping and non-contiguous intervals:

d <- as.data.frame(list(
  start = c('2005-01-01', '2000-01-01', '2001-01-01'),
  stop = c('2006-01-02', '2001-01-02', '2004-01-02')),
  stringsAsFactors = FALSE)

This produces:

> d
       start       stop
1 2005-01-01 2006-01-02
2 2000-01-01 2001-01-02
3 2001-01-01 2004-01-02

> interval_union(d)
       start       stop
1 2000-01-01 2004-01-02
2 2005-01-01 2006-01-02

I am a relative novice to R programming, so if anyone could convert the interval_union() function above to accept as parameters not only the input data frame, but also the names of the 'start' and 'stop' columns to use so the function could be more easily re-usable, that'd be great.

Community
  • 1
  • 1
jhchou
  • 196
  • 1
  • 12
2

Well, in the example you provided, the union of int1 and int2 could be seen just as a vector with the two intervals :

int1 <- new_interval(ymd("2001-01-01"), ymd("2002-01-01"))
int2 <- new_interval(ymd("2003-06-01"), ymd("2004-01-01"))
ints <- c(int1,int2)

%within% works on vectors, so you can do something like this :

my_int <- new_interval(ymd("2001-01-01"), ymd("2004-01-01"))
my_int %within% ints
# [1]  TRUE FALSE

So you can check if your interval is in one of the intervals of your list with any :

any(my_int %within% ints)
# [1] TRUE

Your comment is right, the results given by %within% doesn't seem coherent with the documentation, which says :

If a is an interval, both its start and end dates must fall within b to return TRUE.

If I look at the source code of %within% when a and b are both intervals, it seems to be the following :

setMethod("%within%", signature(a = "Interval", b = "Interval"), function(a,b){
    as.numeric(a@start) - as.numeric(b@start) <= b@.Data & as.numeric(a@start) - as.numeric(b@start) >= 0
})

So it seems that only the starting point of a is tested against b, and it looks coherent with the results. Maybe this should be considered as a bug and should be reported ?

juba
  • 47,631
  • 14
  • 113
  • 118
  • This does not work when ``my_int`` is only a subinterval of the union of two or more intervals, and is not a subinterval of any one interval. I am also surprised that ``my_int %within% int1`` is TRUE as it is not a subinterval; however I have verified that behaviour – orizon Feb 15 '13 at 08:35
  • I'm moderately sure this is a bug in `%within%`, do you want to report it, or should I? – orizon Feb 15 '13 at 08:52
  • @orizon I just updated my answer. I'm not certain this is a bug either, but it looks like one. If you want to report it, don't hesitate ! – juba Feb 15 '13 at 08:55
  • @orizon Thanks ! And I discovered that as you mentioned me, I've been automatically subscribed to the issue . Handy :) – juba Feb 15 '13 at 09:09