Merge overlapping ranges per group

Question

I'm trying to merge intersecting ranges of values within each of my groups (n = 147). For example:

my.df <- data.frame(chrom=c('0F','0F','4F','4F','4F','4F'), start=as.numeric(c(1405,1700,1420,2500,19116,20070)), stop=as.numeric(c(1700,2038,2527,3401,20070,20730)), strand = c('-','-','-','+','+','+'))
my.df

  chrom start  stop strand
1    0F  1405  1700      -
2    0F  1700  2038      -
3    4F  1420  2527      -
4    4F  2500  3401      +
5    4F 19116 20070      +
6    4F 20070 20730      +

And I am trying to find all of the overlapping ranges for each group while also preserving the 'chrm' column and taking into account the strand column and only merging ranges if they have the same 'strandedness':

  chrom start  stop strand
1    0F  1405  2038      -
2    4F  1420  2527      -
3    4F  2500  3401      +
4    4F 19116 20730      +

I've found a few methods for determining the presence of overlaps within each group (e.g., plyranges::count_overlaps), but no way to collapse those intersecting ranges together.

I've tried the method below from a previous question, but it ignores the groupings I require and the ranges for all of my groupings end up overlapping to give a single, continuous range regardless of if all ranges overlap. I've also tried the answers from this question, but none of them worked out.

my.df %>% 
       arrange(start) %>% 
       group_by(g = cumsum(cummax(lag(stop, default = first(stop))) < start)) %>% 
       summarise(start = first(start), stop = max(stop))

     start      end
1     1405    20730

`dplyr` is not well suited for merging based on ranges, look into `fuzzyjoin`, `data.table`, or `sqldf` for robust solutions. — r2evans, Mar 09 '22 at 19:30
For examples, see https://stackoverflow.com/a/64543600/3358272 and https://stackoverflow.com/a/70585103/3358272 and https://stackoverflow.com/a/64284142/3358272. — r2evans, Mar 09 '22 at 19:30

Martin Morgan · Accepted Answer · 2022-03-09T19:46:42.193

2

I used the Bioconductor GenomicRanges package, which seems highly appropriate to your domain.


> ## install.packages("BiocManager")
> ## BiocManager::install("GenomicRanges")
> library(GenomicRanges)
> my.df |> as("GRanges") |> reduce()
GRanges object with 5 ranges and 0 metadata columns:
      seqnames      ranges strand
         <Rle>   <IRanges>  <Rle>
  [1]       4F   2500-3401      +
  [2]       4F 19116-20730      +
  [3]       4F   1420-2527      -
  [4]       0F   1405-1700      -
  [5]       0F   1727-2038      -
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

which differs from your expectation because there are two OF non-overlapping ranges?

edited Mar 09 '22 at 19:46

answered Mar 09 '22 at 19:40

Martin Morgan

45,935
7
84
112

Ah, that's exactly what I was looking for! Must've missed it when I was looking through the GenomicRanges docs. The non-overlapping 0F sequences was due to a typo (corrected in original question). The reduce function works perfectly on my real data. Thank you! – cbg Mar 09 '22 at 21:23

Merge overlapping ranges per group

1 Answers1