0

I'll start with an example, and then describe the logic I'm trying to use.

I have two normal IRanges objects that span the same total range, but may do so in a different number of ranges. Each IRanges has one mcol, but that mcol is different across IRanges.

a
#IRanges object with 1 range and 1 metadata column:
#          start       end     width | on_betalac
#      <integer> <integer> <integer> |  <logical>
#  [1]         1       167       167 |      FALSE
b
#IRanges object with 3 ranges and 1 metadata column:
#          start       end     width |  on_other
#      <integer> <integer> <integer> | <logical>
#  [1]         1       107       107 |     FALSE
#  [2]       108       112         5 |      TRUE
#  [3]       113       167        55 |     FALSE

You can see both of these IRanges span 1 to 167, but a has one range and b has three. I would like to combine them to get output like this:

my_great_function(a, b)
#IRanges object with 3 ranges and 2 metadata columns:
#          start       end     width | on_betalac  on_other
#      <integer> <integer> <integer> |  <logical> <logical>
#  [1]         1       107       107 |     FALSE     FALSE
#  [2]       108       112         5 |     FALSE      TRUE
#  [3]       113       167        55 |     FALSE     FALSE

The output is a like a disjoin of the inputs, but it keeps the mcols, and even spreads them so that the output range has the same value of the mcol as the input range that led to it.

Maurits Evers
  • 49,617
  • 4
  • 47
  • 68
rcorty
  • 1,140
  • 1
  • 10
  • 28

2 Answers2

2

Option 1: Using IRanges::findOverlaps

m <- findOverlaps(b, a)
c <- b[queryHits(m)]
mcols(c) <- cbind(mcols(c), mcols(a[subjectHits(m)]))
#IRanges object with 3 ranges and 2 metadata columns:
#          start       end     width |  on_other on_betacalc
#      <integer> <integer> <integer> | <logical>   <logical>
#  [1]         1       107       107 |     FALSE       FALSE
#  [2]       108       112         5 |      TRUE       FALSE
#  [3]       113       167        55 |     FALSE       FALSE

The resulting object c is a IRanges object with two metadata columns.

Option 2: Using IRanges::mergeByOverlaps

c <- mergeByOverlaps(b, a)
c
#DataFrame with 3 rows and 4 columns
#          b  on_other         a on_betacalc
#  <IRanges> <logical> <IRanges>   <logical>
#1     1-107     FALSE     1-167       FALSE
#2   108-112      TRUE     1-167       FALSE
#3   113-167     FALSE     1-167       FALSE

The resulting output object is a DataFrame with IRanges columns and original metadata columns as additional columns.

Option 3: Using data.table::foverlaps

library(data.table)
a.dt <- as.data.table(cbind.data.frame(a, mcols(a)))[, width := NULL]
b.dt <- as.data.table(cbind.data.frame(b, mcols(b)))[, width := NULL]

setkey(b.dt, start, end)
foverlaps(a.dt, b.dt, type = "any")[, `:=`(i.start = NULL, i.end = NULL)][]
   start end on_other on_betacalc
1:     1 107    FALSE       FALSE
2:   108 112     TRUE       FALSE
3:   113 167    FALSE       FALSE

The resulting object is a data.table.

Option 4: Using fuzzyjoin::interval_left_join

library(fuzzyjoin)
a.df <- cbind.data.frame(a, mcols(a))
b.df <- cbind.data.frame(b, mcols(b))
interval_left_join(b.df, a.df, by = c("start", "end"))
#  start.x end.x width.x on_other start.y end.y width.y on_betacalc
#1       1   107     107    FALSE       1   167     167       FALSE
#2     108   112       5     TRUE       1   167     167       FALSE
#3     113   167      55    FALSE       1   167     167       FALSE

The resulting object is a data.frame.


Sample data

library(IRanges)
a <- IRanges(1, 167)
mcols(a)$on_betacalc = F

b <- IRanges(c(1, 108, 113), c(107, 112, 167))
mcols(b)$on_other <- c(F, T, F)
Maurits Evers
  • 49,617
  • 4
  • 47
  • 68
  • Thank you, this seems very promising, especially option 1. I am now considering a situation with more than two IRanges as input. Would that be a simple extension of option #1 or should I make a new SO question? – rcorty Apr 09 '19 at 01:50
  • You're very welcome @rcorty; yes, merging more than 2 `IRanges` is a bit more challenging; I remember answering a similar question here on SO a while back, perhaps take some time searching for related questions here (and on Biostars/SeqAnswers). The order of the merge will matter in the case of 3 `IRanges`; obviously you're welcome to ask a new question, best to include reproducible sample data with `dput` (which means that we don't have to manually construct sample data like I did here). – Maurits Evers Apr 09 '19 at 01:54
  • [continued] As to the different options I give above: They all give you the same result, so there's really no major advantage of one solution over the other (mind you I haven't benchmarked the different options in terms of performance). – Maurits Evers Apr 09 '19 at 01:56
  • Thank you, I will have a look through related questions. I am very novice with IRanges. The only reason #1 appeals to me is that the output is in precisely the format I want, so don't have to do any wrangling. Can you comment on my proposed solution (below)? – rcorty Apr 09 '19 at 01:59
  • @rcorty I guess you mean the solution you posted yourself? There seems to be a lot more data wrangling going on than in the options I posted. Simply `cbind`ing the metadata columns (as in option 1) should work for *any* number of metadata columns. I think the `for` loop plus `sapply` is not necessary and might make things very slow for larger `IRanges` objects. I imagine the `IRanges` options to be very fast; ditto for the `data.table` solution; the `fuzzyjoin::interval_join` is actually based on `IRanges::findOverlaps` so should be similarly performant. – Maurits Evers Apr 09 '19 at 02:24
0

Here's what I've been able to come up with. Not as elegant as MauritsEvers, but maybe useful to others in some way.

combine_exposures <- function(...) {

  cd <- c(...)
  mc <- mcols(cd)
  dj <- disjoin(x = cd, with.revmap = TRUE)
  r <- mcols(dj)$revmap

  d <- as.data.frame(matrix(nrow = length(dj), ncol = ncol(mc)))
  names(d) <- names(mc)

  for (i in 1:length(dj)) {
    d[i,] <- sapply(X = 1:ncol(mc), FUN = function(j) { mc[r[[i]][j], j] })
  }

  mcols(dj) <- d
  return(dj)
}

here is dput(c(e1, e2, e3, e4)) (e1, e2, e3, and e4 are some example IRanges that all span 1,167):

new("IRanges", start = c(1L, 1L, 108L, 113L, 1L, 1L), width = c(167L, 
107L, 5L, 55L, 167L, 167L), NAMES = NULL, elementType = "ANY", 
    elementMetadata = new("DataFrame", rownames = NULL, nrows = 6L, 
        listData = list(on_betalac = c(FALSE, NA, NA, NA, NA, 
        NA), on_other = c(NA, FALSE, TRUE, FALSE, NA, NA), on_pen = c(NA, 
        NA, NA, NA, FALSE, NA), on_quin = c(NA, NA, NA, NA, NA, 
        FALSE)), elementType = "ANY", elementMetadata = NULL, 
        metadata = list()), metadata = list())
rcorty
  • 1,140
  • 1
  • 10
  • 28