1

This is a followup of a question I asked yesterday, now extended to include more than 2 inputs. I was able to find two related answers on SO, but none of them gave quite enough information for me to solve this in a performant way.

I would like to combine a list of IRanges into a single IRanges. Here's an example input:

[[1]]
IRanges object with 2 ranges and 1 metadata column:
          start       end     width | on_betalac
      <integer> <integer> <integer> |  <logical>
  [1]         1        21        21 |      FALSE
  [2]        22        22         1 |       TRUE

[[2]]
IRanges object with 2 ranges and 1 metadata column:
          start       end     width |  on_other
      <integer> <integer> <integer> | <logical>
  [1]         1        21        21 |     FALSE
  [2]        22        22         1 |      TRUE

[[3]]
IRanges object with 1 range and 1 metadata column:
          start       end     width |    on_pen
      <integer> <integer> <integer> | <logical>
  [1]         1        22        22 |     FALSE

[[4]]
IRanges object with 3 ranges and 1 metadata column:
          start       end     width |   on_quin
      <integer> <integer> <integer> | <logical>
  [1]         1         3         3 |     FALSE
  [2]         4        13        10 |      TRUE
  [3]        14        22         9 |     FALSE

For ease of replication, the dput of this list is at the end of my post.

And my desired output is:

IRanges object with 4 ranges and 4 metadata columns:
          start       end     width | on_betalac  on_other    on_pen   on_quin
      <integer> <integer> <integer> |  <logical> <logical> <logical> <logical>
  [1]         1         3         3 |      FALSE     FALSE     FALSE     FALSE
  [2]         4        13        10 |      FALSE     FALSE     FALSE      TRUE
  [3]        14        21         8 |      FALSE     FALSE     FALSE     FALSE
  [4]        22        22         1 |       TRUE      TRUE     FALSE     FALSE

You can see that the output is kind of like the disjoin of the input, but with the mcols propagated through, so that each output row has the mcols of the input row that "gave rise" to it.

Here's my solution, which works, but is quite slow.

combine_exposures <- function(exposures) {

  cd <- do.call(what = c, args = exposures)
  mc <- mcols(cd)
  dj <- disjoin(x = cd, with.revmap = TRUE)
  r <- mcols(dj)$revmap

  d <- as.data.frame(matrix(nrow = length(dj), ncol = ncol(mc)))
  names(d) <- names(mc)

  for (i in 1:length(dj)) {
    d[i,] <- sapply(X = 1:ncol(mc), FUN = function(j) { mc[r[[i]][j], j] })
  }

  mcols(dj) <- d

  return(dj)
}

And here's the dput of the sample input:

list(new("IRanges", start = c(1L, 22L), width = c(21L, 1L), NAMES = NULL, 
    elementType = "ANY", elementMetadata = new("DataFrame", rownames = NULL, 
        nrows = 2L, listData = list(on_betalac = c(FALSE, TRUE
        )), elementType = "ANY", elementMetadata = NULL, metadata = list()), 
    metadata = list()), new("IRanges", start = c(1L, 22L), width = c(21L, 
1L), NAMES = NULL, elementType = "ANY", elementMetadata = new("DataFrame", 
    rownames = NULL, nrows = 2L, listData = list(on_other = c(FALSE, 
    TRUE)), elementType = "ANY", elementMetadata = NULL, metadata = list()), 
    metadata = list()), new("IRanges", start = 1L, width = 22L, 
    NAMES = NULL, elementType = "ANY", elementMetadata = new("DataFrame", 
        rownames = NULL, nrows = 1L, listData = list(on_pen = FALSE), 
        elementType = "ANY", elementMetadata = NULL, metadata = list()), 
    metadata = list()), new("IRanges", start = c(1L, 4L, 14L), 
    width = c(3L, 10L, 9L), NAMES = NULL, elementType = "ANY", 
    elementMetadata = new("DataFrame", rownames = NULL, nrows = 3L, 
        listData = list(on_quin = c(FALSE, TRUE, FALSE)), elementType = "ANY", 
        elementMetadata = NULL, metadata = list()), metadata = list()))
rcorty
  • 1,140
  • 1
  • 10
  • 28

1 Answers1

0

I've come up with a more efficient version, but still suspect it could be much faster.

new_combine <- function(exposures) {

  cd <- do.call(what = c, args = exposures)
  mc <- mcols(cd)
  dj <- disjoin(x = cd, with.revmap = TRUE)
  r <- mcols(dj)$revmap

  m <- as.matrix(mc)[cbind(unlist(r),
                           rep(1:length(dj), times = ncol(mc)))]


  mcols(dj) <- setNames(as.data.frame(matrix(m, nrow = length(dj), byrow = TRUE)),
                        nm = names(mc))

  return(dj)
}

I ran bench::mark and found that this version is about 3x faster. This is probably good enough for my application, but I get the sense I'm not using IRanges quite right.

expression    min   mean median     max `itr/sec` mem_alloc  n_gc n_itr total_time
  <chr>      <bch:> <bch:> <bch:> <bch:t>     <dbl> <bch:byt> <dbl> <int>   <bch:tm>
1 old        77.9ms 83.9ms 81.3ms 138.1ms      11.9    35.6KB    74    40      3.36s
2 new        27.6ms 29.1ms 28.9ms  34.2ms      34.4    10.6KB    73   252      7.32s
rcorty
  • 1,140
  • 1
  • 10
  • 28