13

I have two data.frames each with three columns: chrom, start & stop, let's call them rangesA and rangesB. For each row of rangesA, I'm looking to find which (if any) row in rangesB fully contains the rangesA row - by which I mean rangesAChrom == rangesBChrom, rangesAStart >= rangesBStart and rangesAStop <= rangesBStop.

Right now I'm doing the following, which I just don't like very much. Note that I'm looping over the rows of rangesA for other reasons, but none of those reasons are likely to be a big deal, it just ends up making things more readable given this particular solution.

rangesA:

chrom   start   stop
 5       100     105
 1       200     250
 9       275     300

rangesB:

chrom    start    stop
  1       200      265
  5       99       106
  9       275      290

for each row in rangesA:

matches <- which((rangesB[,'chrom']  == rangesA[row,'chrom']) &&
                 (rangesB[,'start'] <= rangesA[row, 'start']) &&
                 (rangesB[,'stop'] >= rangesA[row, 'stop']))

I figure there's got to be a better (and by better, I mean faster over large instances of rangesA and rangesB) way to do this than looping over this construct. Any ideas?

zx8754
  • 52,746
  • 12
  • 114
  • 209
geoffjentry
  • 4,674
  • 3
  • 31
  • 37

6 Answers6

21

Use the IRanges/GenomicRanges packages from Bioconductor, which is made for dealing with these exact problems (and scales massively)

source("http://bioconductor.org/biocLite.R")
biocLite("IRanges")

There are a few appropriate containers for ranges on different chromosomes, one is RangesList

library(IRanges)
rangesA <- split(IRanges(rangesA$start, rangesA$stop), rangesA$chrom)
rangesB <- split(IRanges(rangesB$start, rangesB$stop), rangesB$chrom)
#which rangesB wholly contain at least one rangesA?
ov <- countOverlaps(rangesB, rangesA, type="within")>0
Aaron Statham
  • 2,048
  • 1
  • 15
  • 16
  • 1
    Good pointer on IRanges, forgot about that. I didn't end up going with this as it didn't fit my own situation for a variety of reasons, but still good to know. My toy example left out a couple of key bits (my own fault) that made working w/ IRanges difficult for me to figure out, and the merge() solution provided a massive speedup as it was. Also, while it scales massively we also see lots of very small cases and what I'm assuming was the overhead of the S4 was slowing it down in those cases. – geoffjentry Oct 15 '10 at 20:07
  • 1
    is there anyway to count also partially overlaps? – Cina Dec 29 '14 at 05:30
14

This would be a lot easier / faster if you can merge the two objects first.

ranges <- merge(rangesA,rangesB,by="chrom",suffixes=c("A","B"))
ranges[with(ranges, startB <= startA & stopB >= stopA),]
#  chrom startA stopA startB stopB
#1     1    200   250    200   265
#2     5    100   105     99   106
Joshua Ulrich
  • 173,410
  • 32
  • 338
  • 418
12

The data.table package has a function foverlaps() which is capable of merging over interval ranges since v1.9.4:

require(data.table)
setDT(rangesA)
setDT(rangesB)

setkey(rangesB)
foverlaps(rangesA, rangesB, type="within", nomatch=0L)
#    chrom start stop i.start i.stop
# 1:     5    99  106     100    105
# 2:     1   200  265     200    250
  • setDT() converts data.frame to data.table by reference

  • setkey() sorts the data.table by the columns provided (in this case all columns, since we did not provide any), and marks those columns as sorted, which we'll use later to perform the join on.

  • foverlaps() does the overlapping join efficiently. See this answer for a detailed explanation and comparison to other approaches.

Community
  • 1
  • 1
Arun
  • 116,683
  • 26
  • 284
  • 387
5

I add the dplyr solution.

library(dplyr)
inner_join(rangesA, rangesB, by="chrom") %>% 
  filter(start.y < start.x | stop.y > stop.x)

Output:

  chrom start.x stop.x start.y stop.y
1     5     100    105      99    106
2     1     200    250     200    265
Joe
  • 8,073
  • 1
  • 52
  • 58
  • Imagine a situation in which rangesA have 20k rows, and rangesB have 300k rows -> insanse huge join -> imposible to fit in R RAM memory. Solutions with IRanges are better – Marcin Dec 05 '16 at 13:46
4

RangesA and RangesB are clearly BED syntax, this can be done outside R in the command line with BEDtools, extremely fast and flexible with a dozen other options to work with genomic intervals. Then put the results back again into R.

https://code.google.com/p/bedtools/

animuson
  • 53,861
  • 28
  • 137
  • 147
mikyatope
  • 338
  • 1
  • 8
  • speaking from experience, if it's a dataset size that's reasonable to run in R then it's more reproducible to stay in one programming environment than hopping in and out of tools. – lmrta Jul 14 '23 at 08:34
2

For your example data:

rangesA <- data.frame(
    chrom = c(5, 1, 9),
    start = c(100, 200, 275),
    stop = c(105, 250, 300)
)
rangesB <- data.frame(
    chrom = c(1, 5, 9),
    start = c(200, 99, 275),
    stop = c(265, 106, 290)
)

This will do it with sapply, such that each column is one row in rangesA and each row is corresponding row in rangesB:

> sapply(rangesA$stop, '>=', rangesB$start) & sapply(rangesA$start, '<=', rangesB$stop)
      [,1]  [,2]  [,3]
[1,] FALSE  TRUE FALSE
[2,]  TRUE FALSE FALSE
[3,] FALSE FALSE  TRUE
eyjo
  • 1,180
  • 6
  • 8