3

I am trying to collapse a series of intervals into fewer, equally meaningful intervals.

Consider for example this list of intervals

Intervals = list(
  c(23,34),
  c(45,48),
  c(31,35),
  c(7,16),
  c(5,9),
  c(56,57),
  c(55,58)
)

Because the intervals overlap, the same intervals can be described with few vectors. Plotting these intervals make obvious that a list of 4 vectors would be enough

plot(1,1,type="n",xlim=range(unlist(Intervals)),ylim=c(0.9,1.1))
segments(
    x0=sapply(Intervals,"[",1),
    x1=sapply(Intervals,"[",2),
    y0=rep(1,length(Intervals)),
    y1=rep(1,length(Intervals)),
    lwd=10
    )

enter image description here

How can I reduce my Intervals list to carry the same info than the one displayed on the plot? (performance matter)

The desired outputs for the above example is

Intervals = list(
  c(5,16)
  c(23,35),
  c(45,48),
  c(55,58)
)
zx8754
  • 52,746
  • 12
  • 114
  • 209
Remi.b
  • 17,389
  • 28
  • 87
  • 168
  • Not the same, but _possibly_ some relevant pieces in [this Q&A](http://stackoverflow.com/questions/27574775/is-it-possible-to-use-the-r-data-table-funcion-foverlaps-to-find-the-intersectio) – Henrik Jan 26 '16 at 08:00
  • @zx8754 how does this merit the `bioinformatics` tag? – s_baldur Jun 26 '19 at 10:07
  • 1
    @sindri_baldur Feel free to vote to rollback, but Remi's field is genetics, and accepted solution from Ven (also genetics) is from bioconductor IRranges package. Plus I came across to this post form other bioinformatics related post. – zx8754 Jun 26 '19 at 10:22

4 Answers4

4

What you need is the reduce function in the IRanges package.

In.df <- do.call(rbind, Intervals)
library(IRanges)

In.ir <- IRanges(In.df[, 1], In.df[,2])

out.ir <- reduce(In.ir)
out.ir
# IRanges of length 4
#     start end width
# [1]     5  16    12
# [2]    23  35    13
# [3]    45  48     4
# [4]    55  58     4
Ven Yao
  • 3,680
  • 2
  • 27
  • 42
0

One option with base R:

First I put your list in a data.frame:

ints <- as.data.frame(do.call(rbind, Intervals))
names(ints) <- c('start', 'stop')

so it looks like

  start stop
1    23   34
2    45   48
3    31   35
4     7   16
5     5    9
6    56   57
7    55   58

Now, two for loops compare with between, and expand an interval when a crossover is found:

for(x in 1:nrow(ints)){
  for(y in 1:nrow(ints)){
    if(between(ints$start[x], ints$start[y], ints$stop[y])){
      ints$start[x] <- ints$start[y]
      if(ints$stop[y] > ints$stop[x]){
        ints$stop[x] <- ints$stop[y]
      } else {
        ints$stop[y] <- ints$stop[x]
      }
    }
  }
}

which alters ints to

> ints
  start stop
1    23   35
2    45   48
3    23   35
4     5   16
5     5   16
6    55   58
7    55   58

Simplify to unique cases:

ints <- unique(ints, margin = 1)

and put them in order

ints <- ints[order(ints$start),]

which leaves you with

> ints
  start stop
4     5   16
1    23   35
2    45   48
6    55   58

If you want it back in a list like the original,

Intervals <- lapply(1:nrow(ints), function(x)c(ints[x,1], ints[x,2]))

(Note: You can certainly do this with *apply instead of for, Booleans instead of between, and the original list instead of a data.frame, but, well, this is readable. Rewrite/optimize as you like.)

alistaire
  • 42,459
  • 4
  • 77
  • 117
  • Side note: `stop` is probably a bad name for a variable, but I got too far in to change it. Sorry! – alistaire Jan 26 '16 at 04:10
  • Your comment is totally unclear. `stop` is not a variable, it is the name of a column. –  Jan 26 '16 at 04:26
  • 1
    From `?data.frame`: "A data frame is a list of variables of the same number of rows with unique row names, given class "data.frame"." That said, `stop` is a bad name regardless. – alistaire Jan 26 '16 at 04:35
  • Try to access `stop` typing `stop` in your R environment. –  Jan 26 '16 at 06:03
  • `between` is a dplyr function btw – dash2 Nov 05 '21 at 11:04
0

With ivs and iv_groups() for merging overlapping intervals within a single interval vector

library(ivs)
library(purrr)

x = list(
  c(23,34),
  c(45,48),
  c(31,35),
  c(7,16),
  c(5,9),
  c(56,57),
  c(55,58)
)

x <- list_transpose(x)
x
#> [[1]]
#> [1] 23 45 31  7  5 56 55
#> 
#> [[2]]
#> [1] 34 48 35 16  9 57 58

x <- iv(x[[1]], x[[2]])
x
#> <iv<double>[7]>
#> [1] [23, 34) [45, 48) [31, 35) [7, 16)  [5, 9)   [56, 57) [55, 58)

iv_groups(x)
#> <iv<double>[4]>
#> [1] [5, 16)  [23, 35) [45, 48) [55, 58)

The benefit of this over the great IRanges package is that it also works natively on any vector type supported in the tidyverse, like dates.

Davis Vaughan
  • 2,780
  • 9
  • 19
0

I use following function to merge intervals

merge_intervals <- function(start, end) {
    order_idx <- order(start, end)
    start <- start[order_idx]
    end <- end[order_idx]
    len <- length(start)
    if (len >= 2L) {
        groups <- cumsum(c(0L, end[1:(len - 1L)] < start[-1L]))
        if (anyDuplicated(groups)) {
            groups <- factor(groups)
            start <- vapply(split(start, groups), min,
                numeric(1L),
                USE.NAMES = FALSE
            )
            end <- vapply(split(end, groups), max,
                numeric(1L),
                USE.NAMES = FALSE
            )
            Recall(start, end)
        } else {
            list(start, end)
        }
    } else {
        list(start, end)
    }
}

The output corresponding the start and end of the intervals

    x <- list(
        c(23, 34),
        c(45, 48),
        c(31, 35),
        c(7, 16),
        c(5, 9),
        c(56, 57),
        c(55, 58)
    )
    x <- data.table::transpose(x)
    merge_intervals(x[[1L]], x[[2L]])
#> [[1]]
#> [1]  5 23 45 55
#> 
#> [[2]]
#> [1] 16 35 48 58
Yun
  • 1
  • 1