Finding all overlaps in one iteration of foverlap in R's data.table

Question

I'm trying to merge a bunch of overlapping time periods in R using a data.table. I've got a call to foverlap of the table on itself, which is efficient enough.

My problem is such: say period A overlaps period B, and B overlaps period C, but A does not overlap C. In that case A is not grouped with C, and they will eventually have to be merged.

Currently I've got a while loop finding overlaps and merging until no more merges occur, but that's not exactly scalable. One solution I can see is applying the indices of the groups recursively to itself until it stabilises, but that still looks to need a loop, and I want a completely vectorised solution.

dt = data.table(start = c(1,2,4,6,8,10),
                end   = c(2,3,6,8,10,12))
setkeyv(dt,c("start","end"))

f = foverlaps(dt,
              dt,
              type="any",
              mult="first",
              which="TRUE")

#Needs to return [1,1,3,3,3,3]
print(f)
#1 1 3 3 4 5
print(f[f])
#1 1 3 3 3 4
print(f[f][f])
#1 1 3 3 3 3

Can anyone help me with some ideas on vectorising this procedure?

Edit with IDs:

dt = data.table(id = c('A','A','A','A','A','B','B','B'),
                eventStart = c(1,2,4,6,8,10,11,15),
                eventEnd   = c(2,3,6,8,10,12,14,16))
setkeyv(dt,c("id","eventStart","eventEnd"))

f = foverlaps(dt,
              dt,
              type="any",
              mult="first",
              which="TRUE")

#Needs to return [1 1 3 3 3 6 6 8] or similar

Relevant, though rather complex - https://stackoverflow.com/questions/27574775/is-it-possible-to-use-the-r-data-table-function-foverlaps-to-find-the-intersecti or maybe even https://stackoverflow.com/questions/35006269/how-to-combine-intervals-data-into-fewer-intervals-in-r — thelatemail, Aug 08 '17 at 04:00
"#Needs to return [1,1,3,3,3]" -- shouldn't it have a length of six? Anyway, if your intervals are structured like this example where start >= lag(end), then `dt[, cumsum(start-shift(end, fill=0) > 0)]` would seem to work. — Frank, Aug 08 '17 at 05:58
@Frank thanks for that, I had inadvertently copied an earlier version. — Matt, Aug 08 '17 at 08:04
@Frank thanks that is a fantastic solution, so short! Unfortunately my dataset is just a tiny bit freakier than the example, and some intervals are completely subsumed by others, and this seems to fall over a bit under those conditions :/ — Matt, Aug 08 '17 at 23:41

score 6 · Accepted Answer · edited Jun 20 '20 at 09:12

The IRanges package on Bioconductor from which data.table's foverlaps() was inspired has some handy functions for questions like this.

Perhaps, reduce() might be the function you are looking for to merge all overlapping periods:

library(data.table)
dt = data.table(start = c(1,2,4,6,8,10),
                end   = c(2,3,6,8,10,12))

library(IRanges)
ir <- IRanges(dt$start, dt$end)

ir

IRanges object with 6 ranges and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]         1         2         2
  [2]         2         3         2
  [3]         4         6         3
  [4]         6         8         3
  [5]         8        10         3
  [6]        10        12         3

reduce(ir, min.gapwidth = 0L)

IRanges object with 2 ranges and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]         1         3         3
  [2]         4        12         9

as.data.table(reduce(ir, min.gapwidth = 0L))

   start end width
1:     1   3     3
2:     4  12     9

On Bioconductor, there is a comprehensive Introduction to IRanges available.

Edit: The OP has supplied a second sample data set which includes an id column and is asking if IRanges has support for joining intervals by id.

Adding data to IRanges seems to specialize quickly into the area of genome research which is terra incognita to me. However, I've found the following approach using IRanges:

Grouping with `IRanges`

library(data.table)
# 2nd sample data set provided by the OP
dt = data.table(id = c('A','A','A','A','A','B','B','B'),
                eventStart = c(1,2,4,6,8,10,11,15),
                eventEnd   = c(2,3,6,8,10,12,14,16))

library(IRanges)
# set names when constructing IRanges object
ir <- IRanges(dt$eventStart, dt$eventEnd, names = dt$id)

lapply(split(ir, names(ir)), reduce, min.gapwidth = 0L)

$A
IRanges object with 2 ranges and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]         1         3         3
  [2]         4        10         7

$B
IRanges object with 2 ranges and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]        10        14         5
  [2]        15        16         2

Converting this back to data.table leads to a rather clumsy piece of code:

ir <- IRanges(dt$eventStart, dt$eventEnd, names = dt$id)
rbindlist(lapply(split(ir, names(ir)), 
                 function(x) as.data.table(reduce(x, min.gapwidth = 0L))), 
          idcol = "id")

   id start end width
1:  A     1   3     3
2:  A     4  10     7
3:  B    10  14     5
4:  B    15  16     2

Grouping within `data.table`

We can get the same result with a less convoluted code if we group within data.table and apply reduce() on the single chunks:

dt[, as.data.table(reduce(IRanges(eventStart, eventEnd), min.gapwidth = 0L)), id]

Thanks Uwe, that's very helpful. Do you know if IRanges has support for joining intervals by ID (see edit)? — Matt, Aug 08 '17 at 23:26

Finding all overlaps in one iteration of foverlap in R's data.table

1 Answers1

Grouping with `IRanges`

Grouping within `data.table`

Linked

Finding all overlaps in one iteration of foverlap in R's data.table

1 Answers1

Grouping with IRanges

Grouping within data.table

Linked

Grouping with `IRanges`

Grouping within `data.table`