2

I have a data table in R which I want to count the number of overlapping windows for. This is essentially a pileup command and seems to be something that could be done using bedtools, but I can't figure out how to do so without leaving R.

Is there an R function to do this already, or any suggestions about what may be an efficient approach?

In case it's helpful, here is a mini example of what I'm trying to do. Thanks in advance!

input:

chrom    start     end
1        1         100
1        50        150

returns:

chrom      start    end    count
1        1          49      1
1        50         100     2
1        101        150     1
StupidWolf
  • 45,075
  • 17
  • 40
  • 72
Jautis
  • 459
  • 6
  • 13
  • Check out the answers here: https://www.biostars.org/p/173502/ – MrFlick Jun 25 '20 at 21:35
  • 2
    Possible duplicate: https://stackoverflow.com/questions/55679894/counting-overlaps-as-expected-with-r-data-table-foverlaps-or-iranges – MrFlick Jun 25 '20 at 21:36
  • 2
    Related: [Overlap join with start and end positions](https://stackoverflow.com/questions/24480031/overlap-join-with-start-and-end-positions), with nice `data.table` answers. – Henrik Jun 25 '20 at 21:41

2 Answers2

3

While most linked answers use findOverlaps or similar in GRanges context, the easiest approach for me would be to use coverage. It is easy to switch back and forth, to pick the framework that is most convenient or efficient.

library(GenomicRanges)

gr <- GRanges(1, IRanges(c(1,50), c(100,150)))
gr.cov <- as(coverage(gr), "GRanges")
gr.cov
#> GRanges object with 3 ranges and 1 metadata column:
#>       seqnames    ranges strand |     score
#>          <Rle> <IRanges>  <Rle> | <integer>
#>   [1]        1      1-49      * |         1
#>   [2]        1    50-100      * |         2
#>   [3]        1   101-150      * |         1
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome

# if you want to have a data.table
DT <- data.table::as.data.table(gr.cov)
DT
#>    seqnames start end width strand score
#> 1:        1     1  49    49      *     1
#> 2:        1    50 100    51      *     2
#> 3:        1   101 150    50      *     1

# convert back to GRanges:
with(DT, GRanges(seqnames, IRanges(start, end), score=score))
#> GRanges object with 3 ranges and 1 metadata column:
#>       seqnames    ranges strand |     score
#>          <Rle> <IRanges>  <Rle> | <integer>
#>   [1]        1      1-49      * |         1
#>   [2]        1    50-100      * |         2
#>   [3]        1   101-150      * |         1
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Created on 2020-06-25 by the reprex package (v0.3.0)

user12728748
  • 8,106
  • 2
  • 9
  • 14
1

You can use GenomicRanges, first we create the GRanges object:

library(GenomicRanges)
gr = GRanges(seqnames=c(1,1),IRanges(start=c(1,50),end=c(100,150)))

Now you calculate the coverage which returns you an rle object:

COV = coverage(gr)
class(COV)

[1] "SimpleRleList"
attr(,"package")
[1] "IRanges"

What this tells you is for "1", there are 49 runs of 1, 51 runs of 2 and 50 runs of 1. which is more or less what you need, except you need to have it in a data.frame formt.

COV
RleList of length 1
$`1`
integer-Rle of length 150 with 3 runs
  Lengths: 49 51 50
  Values :  1  2  1

To manipulate this, it's better we write a function:

COV2bg = function(cov_obj){

  allchr = lapply(names(cov_obj),function(i){
              ends = cumsum(cov_obj[[i]]@lengths)
              GRanges(
                  seqnames=i,
                  IRanges(start=c(1,ends[-length(ends)]),end=ends),
                  value = cov_obj[[i]]@values
                   )
              })

 Reduce(c,allchr)
}

Then it's a matter of applying it onto the rle object:

as.data.frame(COV2bg(COV))
  seqnames start end width strand value
1        1     1  49    49      *     1
2        1    49 100    52      *     2
3        1   100 150    51      *     1
StupidWolf
  • 45,075
  • 17
  • 40
  • 72