15

I would like to use foverlaps to find the intersecting ranges of two bed files, and collapse any rows containing overlapping ranges into a single row. In the example below I have two tables with genomic ranges. The tables are called "bed" files that have zero-based start coordinates and one-based ending positions of features in chromosomes. For example, START=9, STOP=20 is interpreted to span bases 10 through 20, inclusive. These bed files can contain millions of rows. The solution would need to give the same result, regardless of the order in which the two files to be intersected are provided.

First Table

> table1
   CHROMOSOME START STOP
1:          1     1   10
2:          1    20   50
3:          1    70  130
4:          X     1   20
5:          Y     5  200

Second Table

> table2
   CHROMOSOME START STOP
1:          1     5   12
2:          1    15   55
3:          1    60   65
4:          1   100  110
5:          1   130  131
6:          X    60   80
7:          Y     1   15
8:          Y    10   50

I was thinking that the new foverlaps function could be a very fast way to find the intersecting ranges in these two table to produce a table that would look like:

Result Table:

> resultTable
   CHROMOSOME START STOP
1:          1     5   10
2:          1    20   50
3:          1   100  110
4:          Y     5   50  

Is that possible, or is there a better way to do that in data.table?

I'd also like to first confirm that in one table, for any given CHROMOSOME, the STOP coordinate does not overlap with the start coordinate of the next row. For example, CHROMOSOME Y:1-15 and CHROMOSOME Y:10-50 would need to be collapsed to CHROMOSOME Y:1-50 (see Second Table Rows 7 and 8). This should not be the case, but the function should probably check for that. A real life example of how potential overlaps should be collapsed is below:

   CHROM  START   STOP
1:     1 721281 721619
2:     1 721430 721906
3:     1 721751 722042

Desired output:

   CHROM  START   STOP
1:     1 721281 722042

Functions to create example tables are as follows:

table1 <- data.table(
   CHROMOSOME = as.character(c("1","1","1","X","Y")) ,
   START = c(1,20,70,1,5) ,
   STOP = c(10,50,130,20,200)
)

table2 <- data.table(
   CHROMOSOME = as.character(c("1","1","1","1","1","X","Y","Y")) ,
   START = c(5,15,60,100,130,60,1,10) ,
   STOP = c(12,55,65,110,131,80,15,50)
 )
zx8754
  • 52,746
  • 12
  • 114
  • 209
Pete
  • 323
  • 2
  • 12
  • 1
    @Arun the real data set has ~4 million rows in table 1 and ~230 thousand in table 2. – Pete Dec 20 '14 at 10:51

5 Answers5

10

@Seth provided the fastest way to solve the problem of intersection overlaps using the data.table foverlaps function. However, this solution did not take into account the fact that the input bed files may have overlapping ranges that needed to be reduced into single regions. @Martin Morgan solved that with his solution using the GenomicRanges package, that did both the intersecting and range reducing. However, Martin's solution didn't use the foverlaps function. @Arun pointed out that the overlapping ranges in different rows within a table was not currently possible using foverlaps. Thanks to the answers provided, and some additional research on stackoverflow, I came up with this hybrid solution.

Create example BED files without overlapping regions within each file.

chr <- c(1:22,"X","Y","MT")

#bedA contains 5 million rows
bedA <- data.table(
    CHROM = as.vector(sapply(chr, function(x) rep(x,200000))),
    START = rep(as.integer(seq(1,200000000,1000)),25),
    STOP = rep(as.integer(seq(500,200000000,1000)),25),
    key = c("CHROM","START","STOP")
    )

#bedB contains 500 thousand rows
bedB <- data.table(
  CHROM = as.vector(sapply(chr, function(x) rep(x,20000))),
  START = rep(as.integer(seq(200,200000000,10000)),25),
  STOP = rep(as.integer(seq(600,200000000,10000)),25),
  key = c("CHROM","START","STOP")
)

Now create a new bed file containing the intersecting regions in bedA and bedB.

#This solution uses foverlaps
system.time(tmpA <- intersectBedFiles.foverlaps(bedA,bedB))

user  system elapsed 
1.25    0.02    1.37 

#This solution uses GenomicRanges
system.time(tmpB <- intersectBedFiles.GR(bedA,bedB))

user  system elapsed 
12.95    0.06   13.04 

identical(tmpA,tmpB)
[1] TRUE

Now, modify bedA and bedB such that they contain overlapping regions:

#Create overlapping ranges
makeOverlaps <-  as.integer(c(0,0,600,0,0,0,600,0,0,0))
bedC <- bedA[, STOP := STOP + makeOverlaps, by=CHROM]
bedD <- bedB[, STOP := STOP + makeOverlaps, by=CHROM]

Test time to intersect bed files with overlapping ranges using either the foverlaps or GenomicRanges fucntions.

#This solution uses foverlaps to find the intersection and then run GenomicRanges on the result
system.time(tmpC <- intersectBedFiles.foverlaps(bedC,bedD))

user  system elapsed 
1.83    0.05    1.89 

#This solution uses GenomicRanges
system.time(tmpD <- intersectBedFiles.GR(bedC,bedD))

user  system elapsed 
12.95    0.04   12.99 

identical(tmpC,tmpD)
[1] TRUE

The winner: foverlaps!

FUNCTIONS USED

This is the function based upon foverlaps, and will only call the GenomicRanges function (reduceBed.GenomicRanges) if there are overlapping ranges (which are checked for using the rowShift function).

intersectBedFiles.foverlaps <- function(bed1,bed2) {
  require(data.table)
  bedKey <- c("CHROM","START","STOP")
  if(nrow(bed1)>nrow(bed2)) {
    bed <- foverlaps(bed1, bed2, nomatch = 0)
  } else {
    bed <- foverlaps(bed2, bed1, nomatch = 0)
  }
  bed[, START := pmax(START, i.START)]
  bed[, STOP := pmin(STOP, i.STOP)]
  bed[, `:=`(i.START = NULL, i.STOP = NULL)]
  if(!identical(key(bed),bedKey)) setkeyv(bed,bedKey)
  if(any(bed[, STOP+1 >= rowShift(START), by=CHROM][,V1], na.rm = T)) {
    bed <- reduceBed.GenomicRanges(bed)
  }
  return(bed)
}

rowShift <- function(x, shiftLen = 1L) {
  #Note this function was described in this thread:
  #http://stackoverflow.com/questions/14689424/use-a-value-from-the-previous-row-in-an-r-data-table-calculation
  r <- (1L + shiftLen):(length(x) + shiftLen)
  r[r<1] <- NA
  return(x[r])
}

reduceBed.GenomicRanges <- function(bed) {
  setnames(bed,colnames(bed),bedKey)
  if(!identical(key(bed),bedKey)) setkeyv(bed,bedKey)
  grBed <- makeGRangesFromDataFrame(bed,
    seqnames.field = "CHROM",start.field="START",end.field="STOP")
  grBed <- reduce(grBed)
  grBed <- data.table(
    CHROM=as.character(seqnames(grBed)),
    START=start(grBed),
    STOP=end(grBed),
    key = c("CHROM","START","STOP"))
  return(grBed)
}

This function strictly used the GenomicRanges package, produces the same result, but is about 10 fold slower that the foverlaps funciton.

intersectBedFiles.GR <- function(bed1,bed2) {
  require(data.table)
  require(GenomicRanges)
  bed1 <- makeGRangesFromDataFrame(bed1,
    seqnames.field = "CHROM",start.field="START",end.field="STOP")
  bed2 <- makeGRangesFromDataFrame(bed2,
    seqnames.field = "CHROM",start.field="START",end.field="STOP")
  grMerge <- suppressWarnings(intersect(bed1,bed2))
  resultTable <- data.table(
    CHROM=as.character(seqnames(grMerge)),
    START=start(grMerge),
    STOP=end(grMerge),
    key = c("CHROM","START","STOP"))
  return(resultTable)
}

An additional comparison using IRanges

I found a solution to collapse overlapping regions using IRanges but it is more than 10 fold slower than GenomicRanges.

reduceBed.IRanges <- function(bed) {
  bed.tmp <- bed
  bed.tmp[,group := { 
      ir <-  IRanges(START, STOP);
      subjectHits(findOverlaps(ir, reduce(ir)))
    }, by=CHROM]
  bed.tmp <- bed.tmp[, list(CHROM=unique(CHROM), 
              START=min(START), 
              STOP=max(STOP)),
       by=list(group,CHROM)]
  setkeyv(bed.tmp,bedKey)
  bed[,group := NULL]
  return(bed.tmp[, -(1:2)])
}


system.time(bedC.reduced <- reduceBed.GenomicRanges(bedC))

user  system elapsed 
10.86    0.01   10.89 

system.time(bedD.reduced <- reduceBed.IRanges(bedC))

user  system elapsed 
137.12    0.14  137.58 

identical(bedC.reduced,bedD.reduced)
[1] TRUE
MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
Pete
  • 323
  • 2
  • 12
3

foverlaps() will do nicely.

First set the keys for both of the tables:

setkey(table1, CHROMOSOME, START, STOP)
setkey(table2, CHROMOSOME, START, STOP)

Now join them using foverlaps() with nomatch = 0 to drop unmatched rows in table2.

resultTable <- foverlaps(table1, table2, nomatch = 0)

Next choose the appropriate values for START and STOP, and drop the extra columns.

resultTable[, START := pmax(START, i.START)]
resultTable[, STOP := pmin(STOP, i.STOP)]
resultTable[, `:=`(i.START = NULL, i.STOP = NULL)]

The overlapping STOP to a future START should be a different question. It's actually one that I have, so maybe I'll ask it and come back to it here when I have a good answer.

2

In case you're not stuck on a data.table solution, GenomicRanges

source("http://bioconductor.org/biocLite.R")
biocLite("GenomicRanges")

gives

> library(GenomicRanges)
> intersect(makeGRangesFromDataFrame(table1), makeGRangesFromDataFrame(table2))
GRanges object with 5 ranges and 0 metadata columns:
      seqnames     ranges strand
         <Rle>  <IRanges>  <Rle>
  [1]        1 [  5,  10]      *
  [2]        1 [ 20,  50]      *
  [3]        1 [100, 110]      *
  [4]        1 [130, 130]      *
  [5]        Y [  5,  50]      *
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths
Martin Morgan
  • 45,935
  • 7
  • 84
  • 112
  • So this does the intersection exactly as needed AND merges the overlapping ranges. It takes a few additional steps to convert the output back into a data.table. With big bed files though it is taking about 8 seconds to run, the foverlaps suggested answer by @Arun about 0.5 seconds but it doesn't collapse the overlaps. – Pete Dec 20 '14 at 13:29
  • There are recent improvements in the 'devel' version that boost performance and reduce memory consumption. Lots of genomics operations are natural in GenomicRanges & friends, e.g., rtracklayer::import("your.bed") so maybe you'll want to linger... – Martin Morgan Dec 20 '14 at 13:53
2

In most overlapping ranges problems in genomics, we have one large data set x (usually sequenced reads) and another smaller data set y (usually the gene model, exons, introns etc.). We are tasked with finding which intervals in x overlap with which intervals in y or how many intervals in x overlap for each y interval.

In foverlaps(), we don't have to setkey() on the larger data set x - it's quite an expensive operation. But y needs to have it's key set. For your case, from this example it seems like table2 is larger = x, and table1 = y.

require(data.table)
setkey(table1) # key columns = chr, start, end
ans = foverlaps(table2, table1, type="any", nomatch=0L)
ans[, `:=`(i.START = pmax(START, i.START), 
           i.STOP = pmin(STOP, i.STOP))]

ans = ans[, .(i.START[1L], i.STOP[.N]), by=.(CHROMOSOME, START, STOP)]
#    CHROMOSOME START STOP  V1  V2
# 1:          1     1   10   5  10
# 2:          1    20   50  20  50
# 3:          1    70  130 100 130
# 4:          Y     5  200   5  50

But I agree it'd be great to be able to do this in one step. Not sure how yet, but maybe using additional values reduce and intersect for mult= argument.

Arun
  • 116,683
  • 26
  • 284
  • 387
  • I'm not familiar with the use of "." in front of the list (e.g. ".(CHROMOSOME, START, STOP)") - what is it's purpose. The function does appear to aggregate the overlapping regions on CHROMOSOME Y. However, if I swap table1 and table2 I get a different answer. I'd like it to be able to work either way - either table could be longer. – Pete Dec 20 '14 at 11:49
  • @Pete, `foverlaps()` is designed to find / merge by overlapping ranges. What you're asking for, while can be obtained using `foverlaps()`, won't be straightforward (yet). We've to figure out how best to do this - either provide functionality in `foverlaps()` for obtaining intersecting ranges in case of multiple overlaps or provide another function like GenomicRanges does. I've not given much thought yet, and I won't be able to work on it anytime soon. – Arun Dec 20 '14 at 14:12
  • @Pete - similarly curious, I searched the data.table-package and learned obliquely that '.() is an alias for list()'. This might be worth a line or two of its own, Arun... – malcook Aug 24 '16 at 14:47
  • @arun Any updates on whether there's a way to reduce intervals natively in data.table? I came up with one solution but it doesn't seem to be that efficient (see answer below) – Michael Mar 26 '19 at 23:36
0

Here's a solution entirely in data.table based on Pete's answer. It's actually slower than his solution that uses GenomicRanges and data.table, but still faster than the solution that uses only GenomicRanges.

intersectBedFiles.foverlaps2 <- function(bed1,bed2) {
  require(data.table)
  bedKey <- c("CHROM","START","STOP")
  if(nrow(bed1)>nrow(bed2)) {
    if(!identical(key(bed2),bedKey)) setkeyv(bed2,bedKey)
    bed <- foverlaps(bed1, bed2, nomatch = 0)
  } else {
    if(!identical(key(bed1),bedKey)) setkeyv(bed1,bedKey)
    bed <- foverlaps(bed2, bed1, nomatch = 0)
  }
  bed[,row_id:=1:nrow(bed)]
  bed[, START := pmax(START, i.START)]
  bed[, STOP := pmin(STOP, i.STOP)]
  bed[, `:=`(i.START = NULL, i.STOP = NULL)]

  setkeyv(bed,bedKey)
  temp <- foverlaps(bed,bed)

  temp[, `:=`(c("START","STOP"),list(min(START,i.START),max(STOP,i.STOP))),by=row_id]
  temp[, `:=`(c("START","STOP"),list(min(START,i.START),max(STOP,i.STOP))),by=i.row_id]
  out <- unique(temp[,.(CHROM,START,STOP)])
  setkeyv(out,bedKey)
  out
}
Michael
  • 5,808
  • 4
  • 30
  • 39