4

I'm trying to merge a bunch of overlapping time periods in R using a data.table. I've got a call to foverlap of the table on itself, which is efficient enough.

My problem is such: say period A overlaps period B, and B overlaps period C, but A does not overlap C. In that case A is not grouped with C, and they will eventually have to be merged.

Currently I've got a while loop finding overlaps and merging until no more merges occur, but that's not exactly scalable. One solution I can see is applying the indices of the groups recursively to itself until it stabilises, but that still looks to need a loop, and I want a completely vectorised solution.

dt = data.table(start = c(1,2,4,6,8,10),
                end   = c(2,3,6,8,10,12))
setkeyv(dt,c("start","end"))

f = foverlaps(dt,
              dt,
              type="any",
              mult="first",
              which="TRUE")

#Needs to return [1,1,3,3,3,3]
print(f)
#1 1 3 3 4 5
print(f[f])
#1 1 3 3 3 4
print(f[f][f])
#1 1 3 3 3 3

Can anyone help me with some ideas on vectorising this procedure?

Edit with IDs:

dt = data.table(id = c('A','A','A','A','A','B','B','B'),
                eventStart = c(1,2,4,6,8,10,11,15),
                eventEnd   = c(2,3,6,8,10,12,14,16))
setkeyv(dt,c("id","eventStart","eventEnd"))

f = foverlaps(dt,
              dt,
              type="any",
              mult="first",
              which="TRUE")

#Needs to return [1 1 3 3 3 6 6 8] or similar
Matt
  • 518
  • 2
  • 5
  • 19
  • Relevant, though rather complex - https://stackoverflow.com/questions/27574775/is-it-possible-to-use-the-r-data-table-function-foverlaps-to-find-the-intersecti or maybe even https://stackoverflow.com/questions/35006269/how-to-combine-intervals-data-into-fewer-intervals-in-r – thelatemail Aug 08 '17 at 04:00
  • 1
    "#Needs to return [1,1,3,3,3]" -- shouldn't it have a length of six? Anyway, if your intervals are structured like this example where start >= lag(end), then `dt[, cumsum(start-shift(end, fill=0) > 0)]` would seem to work. – Frank Aug 08 '17 at 05:58
  • @Frank thanks for that, I had inadvertently copied an earlier version. – Matt Aug 08 '17 at 08:04
  • 1
    @Frank thanks that is a fantastic solution, so short! Unfortunately my dataset is just a tiny bit freakier than the example, and some intervals are completely subsumed by others, and this seems to fall over a bit under those conditions :/ – Matt Aug 08 '17 at 23:41

1 Answers1

6

The IRanges package on Bioconductor from which data.table's foverlaps() was inspired has some handy functions for questions like this.

Perhaps, reduce() might be the function you are looking for to merge all overlapping periods:

library(data.table)
dt = data.table(start = c(1,2,4,6,8,10),
                end   = c(2,3,6,8,10,12))

library(IRanges)
ir <- IRanges(dt$start, dt$end)

ir
IRanges object with 6 ranges and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]         1         2         2
  [2]         2         3         2
  [3]         4         6         3
  [4]         6         8         3
  [5]         8        10         3
  [6]        10        12         3
reduce(ir, min.gapwidth = 0L)
IRanges object with 2 ranges and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]         1         3         3
  [2]         4        12         9
as.data.table(reduce(ir, min.gapwidth = 0L))
   start end width
1:     1   3     3
2:     4  12     9

On Bioconductor, there is a comprehensive Introduction to IRanges available.


Edit: The OP has supplied a second sample data set which includes an id column and is asking if IRanges has support for joining intervals by id.

Adding data to IRanges seems to specialize quickly into the area of genome research which is terra incognita to me. However, I've found the following approach using IRanges:

Grouping with IRanges

library(data.table)
# 2nd sample data set provided by the OP
dt = data.table(id = c('A','A','A','A','A','B','B','B'),
                eventStart = c(1,2,4,6,8,10,11,15),
                eventEnd   = c(2,3,6,8,10,12,14,16))

library(IRanges)
# set names when constructing IRanges object
ir <- IRanges(dt$eventStart, dt$eventEnd, names = dt$id)

lapply(split(ir, names(ir)), reduce, min.gapwidth = 0L)
$A
IRanges object with 2 ranges and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]         1         3         3
  [2]         4        10         7

$B
IRanges object with 2 ranges and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]        10        14         5
  [2]        15        16         2

Converting this back to data.table leads to a rather clumsy piece of code:

ir <- IRanges(dt$eventStart, dt$eventEnd, names = dt$id)
rbindlist(lapply(split(ir, names(ir)), 
                 function(x) as.data.table(reduce(x, min.gapwidth = 0L))), 
          idcol = "id")
   id start end width
1:  A     1   3     3
2:  A     4  10     7
3:  B    10  14     5
4:  B    15  16     2

Grouping within data.table

We can get the same result with a less convoluted code if we group within data.table and apply reduce() on the single chunks:

dt[, as.data.table(reduce(IRanges(eventStart, eventEnd), min.gapwidth = 0L)), id]
Community
  • 1
  • 1
Uwe
  • 41,420
  • 11
  • 90
  • 134
  • Thanks Uwe, that's very helpful. Do you know if IRanges has support for joining intervals by ID (see edit)? – Matt Aug 08 '17 at 23:26