10

I am working with a large vector of integers in R (around 10 million integers), and I need to find every distinct pair of integers from this vector that differ by 500 or less and make a histogram of their differences (i.e. for each pair, the second one minus the first one).

Here's the totally-unvectorized code for doing what I want extremely slowly:

# Generate some random example data
V <- round(rnorm(100) * 1000)

# Prepare the histogram
my.hist <- rep(0, 500)
names(my.hist) <- as.character(seq(1,500))
for (x1 in V) {
    for (x2 in V) {
        difference = x2 - x1
        if (difference > 0 && difference <= 500) {
            my.hist[difference] = my.hist[difference] + 1
        }
    }
}

(Assume that every integer is unique, so that difference > 0 bit is OK. This is allowed because I actually don't care about any cases where the difference is zero.)

Here's some code that vectorizes the inner loop:

my.hist2 <- rep(0, 500)
names(my.hist2) <- as.character(seq(1,500))
for (x1 in V) {
    differences <- V[V > x1 & V <= x1+500] - x1
    difftable <- table(differences)
    my.hist2[names(difftable)] = my.hist2[names(difftable)] + difftable
}

This is certainly faster than the first version. However even this variant is already too slow when V contains only 500000 elements (half a million).

I can do this without any explicit loops as follows:

X <- combn(V, 2)
# X is a matrix with two rows where each column represents a pair
diffs <- abs(X[2,] - X[1,])
my.hist3 <- table(diffs[diffs <= 500])

However the matrix X will contain 10e6 * (10e6 - 1) / 2, or about 50,000,000,000,000 columns, which might be a problem.

So is there a way to do this without explicit looping (too slow) or expanding a matrix of all the pairs (too big)?

If you're wondering why I need to do this, I'm implementing this: http://biowhat.ucsd.edu/homer/chipseq/qc.html#Sequencing_Fragment_Length_Estimation

Ryan C. Thompson
  • 40,856
  • 28
  • 97
  • 159
  • do i understand this correctly that you have 50,000,000,000,000 columns with 10,000,000 elements each? I'd love to see this solved :) The magnitude makes it an awesome problem. – ma cılay Mar 02 '12 at 02:38
  • @user306848: I think Ryan means that `combn` will return a 2 x 50E12 matrix of pairs (all possible combinations). – jbaums Mar 02 '12 at 02:45
  • Yes, @jbaums is correct. – Ryan C. Thompson Mar 02 '12 at 03:44
  • What does the range/distribution of your vector look like? How many unique integers are there? – frankc Mar 02 '12 at 18:49
  • Assume no two integers are equal, and they range from, let's say, 0 to 100,000,000. The distribution within that range is expected to be highly non-uniform, with many clusters of similar values. (The clusters represent coordinates in the human genome where a particular protein attaches itself.) – Ryan C. Thompson Mar 02 '12 at 19:11

1 Answers1

16

One possible improvement is to sort the data: the pairs (i,j) for which the distance is under 500 will then be close to the diagonal, and you do not have to explore all the values.

The code would look like this (it is still very slow).

n <- 1e5
k <- 500
V <- round(rnorm(n) * n * 10)
V <- as.integer(V)
V <- sort(V)
h <- rep(0,k)

for(i in 1:(n-1)) {
  for(j in (i+1):n) {
    d <- V[j] - V[i]
    if( d > k ) break
    if( d > 0 ) h[d] <- h[d]+1
  }
}

Edit: If you want something much faster, you can write the loop in C. It takes 1s for your 10 million elements.

n <- 10e6
k <- 500
V <- round(rnorm(n) * n * 10)
V <- as.integer(V)
V <- sort(V)
h <- rep(0,k)

library(inline)
sig <- signature(n="integer", v="integer", k="integer", h="integer")
code <- "
  for( int i = 0; i < (*n) - 1; i++ ) {
    for( int j = i + 1; j < *n; j++ ) {
      int d = v[j] - v[i];
      if( d > *k ) break;
      if( d > 0 ) h[d-1]++;
    }
  }
"
f <- cfunction( sig, code, convention=".C" )
h <- f(n,V,k,h)$h
Ryan C. Thompson
  • 40,856
  • 28
  • 97
  • 159
Vincent Zoonekynd
  • 31,893
  • 5
  • 69
  • 78
  • Wish I could upvote more. I had the same strategy in mind but your implementation is _much_ better. – IRTFM Mar 02 '12 at 02:59
  • Does the C inlining functionality not support function arguments and return statements? – Ryan C. Thompson Mar 02 '12 at 04:56
  • Actually, I'm a little confused as to how things are getting passed from R to C. In the C code, are the variables n, v, k, and h pointers to C ints and arrays of ints, or to R vectors? Or is the in-memory representation of an R integer vector compatible with a C array of ints, so it's actually both? – Ryan C. Thompson Mar 02 '12 at 05:06
  • 1
    There are two conventions to exchange data between R and C: `.Call`, which uses R objects (of type SEXPR in C), and `.C`, which uses pointers (to C arrays). If you just have arrays of numbers, the `.C` convention is easier to use, but you also have to pass the size of the arrays (this is usually done in a wrapper function, so that the end-user never sees it). In most cases, it is easier to create the objects containing the results in R, and populate them in C. There is more information in the [Writing R extensions](http://cran.r-project.org/doc/manuals/R-exts.pdf) manual. – Vincent Zoonekynd Mar 02 '12 at 05:41
  • Ok, after reading that manual, I think I understand. R vectors are converted to C arrays when passed to a `cfunction` using the `.C` convention. What about the return value? From a toy example that I created, it looks like it returns a list of the variables in the signature, converted back to R vectors. Is this correct? – Ryan C. Thompson Mar 02 '12 at 18:18
  • Yay, after fixing a small typo (changing i to j in the second loop), the inline C code works fantastically. Thanks! – Ryan C. Thompson Mar 02 '12 at 19:44