16

I have a data.frame with start and end time:

ranges<- data.frame(start = c(65.72000,65.72187, 65.94312,73.75625,89.61625),stop = c(79.72187,79.72375,79.94312,87.75625,104.94062))

> ranges
     start      stop
1 65.72000  79.72187
2 65.72187  79.72375
3 65.94312  79.94312
4 73.75625  87.75625
5 89.61625 104.94062

In this example, the ranges in row 2 and 3 are entirely within the range between 'start' on row 1 and stop on row 4. Thus, the overlapping ranges 1-4 should be collapsed to one range:

> ranges
     start      stop
1 65.72000  87.75625
5 89.61625 104.94062

I tried this:

mdat <- outer(ranges$start, ranges$stop, function(x,y) y > x)
mdat[upper.tri(mdat)|col(mdat)==row(mdat)] <- NA
mdat

And now I just need to figure out how to combine all the true ones, but not sure if it's the best way to go

Henrik
  • 65,555
  • 14
  • 143
  • 159
Liza
  • 1,066
  • 2
  • 16
  • 26
  • 1
    does not look too easy. Have you had a look on dplyr, already? Would be my first try to solve. – Rockbar Jan 19 '17 at 17:15
  • I tried this: `mdat <- outer(ranges$start, ranges$stop, function(x,y) y > x) mdat[upper.tri(mdat)|col(mdat)==row(mdat)] <- NA mdat` And now I just need to figure out how to combine all the true ones, but not sure if it's the best way to go – Liza Jan 19 '17 at 17:22
  • what are you defining as being overlapped? – MJH Jan 19 '17 at 17:25
  • Related: [collapse intersecting regions in R](http://stackoverflow.com/questions/16957293/collapse-intersecting-regions-in-r), [Merge Overlapping Ranges into unique groups](http://stackoverflow.com/questions/15235821/merge-overlapping-ranges-into-unique-groups). – Henrik Jan 19 '17 at 17:26
  • I guess if the start of the next row falls within the range of the previous row, then I want to keep the "stop" of that next row. So if the start in the next row is 73.75625 (with the stop at 87.75625) ant the range in the previous row is 65.94312 to 79.94312, then I want to have a range as a combination of the two, basically the range which includes both: 65.94312 to 87.75625 – Liza Jan 19 '17 at 17:30

5 Answers5

21

You can try this:

library(dplyr)
ranges %>% 
       arrange(start) %>% 
       group_by(g = cumsum(cummax(lag(stop, default = first(stop))) < start)) %>% 
       summarise(start = first(start), stop = max(stop))

# A tibble: 2 × 3
#      g    start      stop
#  <int>    <dbl>     <dbl>
#1     0 65.72000  87.75625
#2     1 89.61625 104.94062
Psidom
  • 209,562
  • 33
  • 339
  • 356
  • 4
    This solves my problem perfectly but I don't quite understand how yet. Kudos to your fiddle-skills while I go through the code again trying to discern its magic. – Jemus42 Aug 28 '18 at 06:03
  • 2
    @Jemus42 the code first sorts the rows by `start` column. `lag(stop)` adds an imaginary column that contains the previous `stop` values. `cummax` will keep the highest of previous `stop` values so that you can compare with the `start` column. If the `start` value is bigger then the maximum of the preceding `stop` values you are in a new group. `cumsum` will accumulate the `TRUE`s so that you have an identifier of each group (`g`). `summarise` will aggregate by group id and take the begging and the end of the interval. Very smart solution @Psidom :) – Emer Apr 22 '19 at 09:54
  • This method helped me find all genes with any amount of overlap between their start and end locations. Thank you! – Anurag N. Sharma Mar 05 '23 at 12:06
7

Here is a data.table solution

library(data.table)
setDT(ranges)
ranges[, .(start=min(start), stop=max(stop)),
       by=.(group=cumsum(c(1, tail(start, -1) > head(stop, -1))))]
   group    start      stop
1:     1 65.72000  87.75625
2:     2 89.61625 104.94062

Here, groups are constructed by checking if the previous start is greater than stop and then using cumsum. within each group, minimum of start and maximum of stop are calculated.

lmo
  • 37,904
  • 9
  • 56
  • 69
5

With base R and melt / unstack, let's add a few more dates to make the problem more interesting and generic:

ranges<- data.frame(start = c(65.72000,65.72187, 65.94312,73.75625,89.61625,105.1,104.99),stop = c(79.72187,79.72375,79.94312,87.75625,104.94062,110.22,108.01))
ranges
#      start      stop
#1  65.72000  79.72187
#2  65.72187  79.72375
#3  65.94312  79.94312
#4  73.75625  87.75625
#5  89.61625 104.94062
#6 105.10000 110.22000
#7 104.99000 108.01000

library(reshape2)
ranges <- melt(ranges)
ranges <- ranges[order(ranges$value),]
ranges
#   variable     value
#1     start  65.72000
#2     start  65.72187
#3     start  65.94312
#4     start  73.75625
#8      stop  79.72187
#9      stop  79.72375
#10     stop  79.94312
#11     stop  87.75625
#5     start  89.61625
#12     stop 104.94062
#7     start 104.99000
#6     start 105.10000
#14     stop 108.01000
#13     stop 110.22000

Now as can be seen from above, (with one reasonable assumption that we have a start value that is smallest of all the values and a stop value that is the largest of all the values), the problem reduces to finding the pattern stop followed by a start in consecutive rows and that will be the only points of interest for us (to find the overlapping ranges) apart from the first and the last row. The following code achieves that:

indices <- intersect(which(ranges$variable=='start')-1, which(ranges$variable=='stop'))
unstack(ranges[c(1, sort(c(indices, indices+1)), nrow(ranges)),], value~variable)
#      start      stop
#1  65.72000  87.75625
#2  89.61625 104.94062
#3 104.99000 110.22000
Sandipan Dey
  • 21,482
  • 2
  • 51
  • 63
0

A tidy solution with package ivs:

library(dplyr)
library(ivs)

g <- iv_groups(iv(ranges$start, ranges$stop))
data.frame(start = iv_start(g),
           end = iv_end(g))
     start       end
1 65.72000  87.75625
2 89.61625 104.94062

also works:

ranges %>% 
  group_by(gp = iv_identify_group(iv(start, stop))) %>% 
  summarise(start = min(iv_start(gp)),
            end = max(iv_end(gp))) %>% 
  select(-gp)
Maël
  • 45,206
  • 3
  • 29
  • 67
0

lmo's otherwise good data.table answer needs one or two adjustments to work reliably. Since that edit queue is full, I'm posting this as a new answer:

library(data.table)
setDT(ranges)
ranges[
  order(start),
  .(start = start[1], stop = max(stop)),
  by = .(group = cumsum(start > cummax(shift(stop, fill = -Inf))))
]
#>    group    start      stop
#> 1:     1 65.72000  87.75625
#> 2:     2 89.61625 104.94062

The order(start) ensures that it also works if the input is not already sorted by start. Skip this if you know that the input is already sorted.

Like in Psidom's dplyr answer, I've added a cummax for the stops. This is important when a new range does not overlap the one immediately before it, but it does overlap a (longer) range before that. Here is another example that illustrates this case:

library(data.table)

# Example data.table with the following intervals:
#   12–13
#           22––24
#               24–––27
#                  26–––29
#                          30–––––––34
#                          30––32
#                                 33––35
#                                          41–––––48
ranges <- data.table(
  start = c(12, 22, 24, 26, 30, 30, 33, 41),
  stop  = c(13, 24, 27, 29, 34, 32, 35, 48)
)

ranges[
  order(start),
  .(start = start[1], stop = max(stop)),
  by = .(group = cumsum(start > cummax(shift(stop, fill = -Inf))))
]
#>    group start stop
#> 1:     1    12   13
#> 2:     2    22   29
#> 3:     3    30   35
#> 4:     4    41   48

Solutions that only consider the immediately preceding row, will list the 33–35 interval separately.

tinygreen
  • 1
  • 2