4

I am wondering what would be the best way to define all ranges which are not covered by given set of ranges. For example, if I have a set of genes with known coordinates:

dtGenes <- fread(
  "id,start,end
 1,1000,1300
 2,1200,1500
 3,1600,2600
 4,3000,4000
")

Let's say I know that total length of the chromosome (for simplicity, assume they are all on the same chromosome) is 10000. So, finally I expect to have the following list of intergenic regions:

"startR,endR
    0,1000
 1500,1600
 2600,3000
 4000,10000
"

can Bioconductor's IRange be useful here? or there is some other good way to solve this?

Vasily A
  • 8,256
  • 10
  • 42
  • 76

2 Answers2

4

Using the Bioconductor package GenomicRanges, transform your original data to a GRanges

library(GenomicRanges)
gr <- with(dtGenes, GRanges("chr1", IRanges(start, end, names=id),
                            seqlengths=c(chr1=10000)))

Then find the gaps between your genes

gaps <- gaps(gr)

GRanges knows about strand. You didn't specify a strand in the GRanges constructor, so strand was assigned *. There are therefore 'gaps' on the +, -, and * strands, and you're only interested in those on the * strand

> gaps[strand(gaps) == "*"]
GRanges with 4 ranges and 0 metadata columns:
      seqnames        ranges strand
         <Rle>     <IRanges>  <Rle>
  [1]     chr1 [   1,   999]      *
  [2]     chr1 [1501,  1599]      *
  [3]     chr1 [2601,  2999]      *
  [4]     chr1 [4001, 10000]      *
  ---
  seqlengths:
    chr1
   10000

Note the Bioconductor convention that chromosomes start at 1, and that the ranges are closed -- the start and end coordinates are included in the range. Use shift and narrow on gr to make your ranges consistent with Bioconductor conventions. GRanges operations are efficient on 10's of millions of ranges.

Martin Morgan
  • 45,935
  • 7
  • 84
  • 112
1

You can use reduce from IRanges package

reduce first orders the ranges in x from left to right, then merges the overlapping or adjacent ones.

library(IRanges)
dat <- read.csv(text="id,start,end
 1,1000,1300
 2,1200,1500
 3,1600,2600
 4,3000,4000
")

ir <- IRanges(dat$start,dat$end)
rir <- reduce(ir)
IRanges of length 3
    start  end width
[1]  1000 1500   501
[2]  1600 2600  1001
[3]  3000 4000  1001
agstudy
  • 119,832
  • 17
  • 199
  • 261
  • thanks for pointing! `reduce` doesn't solve completely the task of finding the outside regions, but it does important first step of it; it will be definitely useful for me in some other cases, but for this given task I like very much the solution with `gaps` which does all job by single operation. – Vasily A Nov 28 '13 at 20:20
  • @VasilyA I don't use a lot iranges package. So of course MartinMorgan solution is the way to go here! – agstudy Nov 28 '13 at 20:49