7

I have a Run length encoded vector representing some value at every position on the genome, in order. As a toy example suppose I had just one chromosome of length 10, then I would have a vector looking like

library(GenomicRanges)

set.seed(1)
toyData = Rle(sample(1:3,10,replace=TRUE))

I would like to coerce this into a GRanges object. The best I can come up with is

gr = GRanges('toyChr',IRanges(cumsum(c(0,runLength(toyData)[-nrun(toyData)])),
                              width=runLength(toyData)),
             toyData = runValue(toyData))

which works, but is quite slow. Is there a faster way to construct the same object?

The Unfun Cat
  • 29,987
  • 31
  • 114
  • 156
user1356855
  • 565
  • 2
  • 5
  • 11
  • 1
    you can use `start(toyData)-1` to get the starts of the interval but it doesn't improve speed. – NicE Sep 13 '16 at 14:46
  • @NicE Thanks for tip, even if it's not faster it is much clearer and concise. – user1356855 Sep 15 '16 at 13:59
  • 1
  • @user1356855, what are some typical chromosome lengths you would encounter? Also, would 3 be enough variation in your real-world application (e.g. could you have `sample(1:15,10^8,replace=TRUE)`)? – Joseph Wood Sep 19 '16 at 08:56
  • @JosephWood Yeah, the values stored with genomic data are often real numbers so 3 would not be enough... But I would accept almost any answer. Longest genome: 247,249,719 for example? Chr1 humans... – The Unfun Cat Sep 19 '16 at 09:03
  • Could it be, that the problem is not transforming the `Rle` object. It is the `IRanges`/`GRanges` function which is in general the slow part? – Roman Sep 19 '16 at 10:05
  • @Jimbou, that is exactly what I'm thinking. I'm wondering how the folks at [https://bioconductor.org/](https://bioconductor.org/) would handle this. – Joseph Wood Sep 19 '16 at 12:10

1 Answers1

4

As @TheUnfunCat pointed out, the OP's solution is pretty solid. The solution below is only moderately faster than the original solution. I tried almost every combination of base R and couldn't beat the efficiency Rle from the S4Vectors package, thus I resorted to Rcpp. Here is the main function :

GenomeRcpp <- function(v) {
    x <- WhichDiffZero(v)
    m <- v[c(1L,x+1L)]
    s <- c(0L,x)
    e <- c(x,length(v))-1L
    GRanges('toyChr',IRanges(start = s, end = e), toyData = m)
}

The WhichDiffZero is the Rcpp function that pretty much does the exact same thing as which(diff(v) != 0) in base R. Much credit goes to @G.Grothendieck.

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
IntegerVector WhichDiffZero(IntegerVector x) {
    int nx = x.size()-1;
    std::vector<int> y;
    y.reserve(nx);
    for(int i = 0; i < nx; i++) {
        if (x[i] != x[i+1]) y.push_back(i+1);
    }
    return wrap(y);
}

Below are some benchmarks:

set.seed(437)
testData <- do.call(c,lapply(1:10^5, function(x) rep(sample(1:50, 1), sample(1:30, 1))))

microbenchmark(GenomeRcpp(testData), GenomeOrig(testData))
Unit: milliseconds
                expr      min       lq     mean   median       uq      max neval cld
GenomeRcpp(testData) 20.30118 22.45121 26.59644 24.62041 27.28459 198.9773   100   a
GenomeOrig(testData) 25.11047 27.12811 31.73180 28.96914 32.16538 225.1727   100   a

identical(GenomeRcpp(testData), GenomeOrig(testData))
[1] TRUE

I've been working on this off and on for the past few days and I'm definitely not satisfied. I'm hoping that someone will take what I've done (as it is a different approach) and create something much better.

Community
  • 1
  • 1
Joseph Wood
  • 7,077
  • 2
  • 30
  • 65
  • This might mean the OPs metadata contained non-vectorized data? Having objects in "vectors" is possible in pandas, dunno about R... – The Unfun Cat Sep 19 '16 at 09:52
  • I have to admit, I'm not completely satisfied either. It seems like the GRanges object is sufficiently similar to a Rle vector (for the one chromosome) that the construction step should be basically instant. Instead it's the slowest part of my code. Clearly I don't understand the internals well enough to know why this is wrong/how to make it faster. The Rcpp alternative is neat though and does give some extra speed. Thanks! – user1356855 Sep 19 '16 at 15:01