0

I'm trying to merge intersecting ranges of values within each of my groups (n = 147). For example:

my.df <- data.frame(chrom=c('0F','0F','4F','4F','4F','4F'), start=as.numeric(c(1405,1700,1420,2500,19116,20070)), stop=as.numeric(c(1700,2038,2527,3401,20070,20730)), strand = c('-','-','-','+','+','+'))
my.df

  chrom start  stop strand
1    0F  1405  1700      -
2    0F  1700  2038      -
3    4F  1420  2527      -
4    4F  2500  3401      +
5    4F 19116 20070      +
6    4F 20070 20730      +

And I am trying to find all of the overlapping ranges for each group while also preserving the 'chrm' column and taking into account the strand column and only merging ranges if they have the same 'strandedness':

  chrom start  stop strand
1    0F  1405  2038      -
2    4F  1420  2527      -
3    4F  2500  3401      +
4    4F 19116 20730      +

I've found a few methods for determining the presence of overlaps within each group (e.g., plyranges::count_overlaps), but no way to collapse those intersecting ranges together.

I've tried the method below from a previous question, but it ignores the groupings I require and the ranges for all of my groupings end up overlapping to give a single, continuous range regardless of if all ranges overlap. I've also tried the answers from this question, but none of them worked out.

my.df %>% 
       arrange(start) %>% 
       group_by(g = cumsum(cummax(lag(stop, default = first(stop))) < start)) %>% 
       summarise(start = first(start), stop = max(stop))

     start      end
1     1405    20730 
cbg
  • 3
  • 2
  • `dplyr` is not well suited for merging based on ranges, look into `fuzzyjoin`, `data.table`, or `sqldf` for robust solutions. – r2evans Mar 09 '22 at 19:30
  • For examples, see https://stackoverflow.com/a/64543600/3358272 and https://stackoverflow.com/a/70585103/3358272 and https://stackoverflow.com/a/64284142/3358272. – r2evans Mar 09 '22 at 19:30

1 Answers1

2

I used the Bioconductor GenomicRanges package, which seems highly appropriate to your domain.


> ## install.packages("BiocManager")
> ## BiocManager::install("GenomicRanges")
> library(GenomicRanges)
> my.df |> as("GRanges") |> reduce()
GRanges object with 5 ranges and 0 metadata columns:
      seqnames      ranges strand
         <Rle>   <IRanges>  <Rle>
  [1]       4F   2500-3401      +
  [2]       4F 19116-20730      +
  [3]       4F   1420-2527      -
  [4]       0F   1405-1700      -
  [5]       0F   1727-2038      -
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

which differs from your expectation because there are two OF non-overlapping ranges?

Martin Morgan
  • 45,935
  • 7
  • 84
  • 112
  • Ah, that's exactly what I was looking for! Must've missed it when I was looking through the GenomicRanges docs. The non-overlapping 0F sequences was due to a typo (corrected in original question). The reduce function works perfectly on my real data. Thank you! – cbg Mar 09 '22 at 21:23