I am working with a large vector of integers in R (around 10 million integers), and I need to find every distinct pair of integers from this vector that differ by 500 or less and make a histogram of their differences (i.e. for each pair, the second one minus the first one).
Here's the totally-unvectorized code for doing what I want extremely slowly:
# Generate some random example data
V <- round(rnorm(100) * 1000)
# Prepare the histogram
my.hist <- rep(0, 500)
names(my.hist) <- as.character(seq(1,500))
for (x1 in V) {
for (x2 in V) {
difference = x2 - x1
if (difference > 0 && difference <= 500) {
my.hist[difference] = my.hist[difference] + 1
}
}
}
(Assume that every integer is unique, so that difference > 0
bit is OK. This is allowed because I actually don't care about any cases where the difference is zero.)
Here's some code that vectorizes the inner loop:
my.hist2 <- rep(0, 500)
names(my.hist2) <- as.character(seq(1,500))
for (x1 in V) {
differences <- V[V > x1 & V <= x1+500] - x1
difftable <- table(differences)
my.hist2[names(difftable)] = my.hist2[names(difftable)] + difftable
}
This is certainly faster than the first version. However even this variant is already too slow when V
contains only 500000 elements (half a million).
I can do this without any explicit loops as follows:
X <- combn(V, 2)
# X is a matrix with two rows where each column represents a pair
diffs <- abs(X[2,] - X[1,])
my.hist3 <- table(diffs[diffs <= 500])
However the matrix X will contain 10e6 * (10e6 - 1) / 2
, or about 50,000,000,000,000 columns, which might be a problem.
So is there a way to do this without explicit looping (too slow) or expanding a matrix of all the pairs (too big)?
If you're wondering why I need to do this, I'm implementing this: http://biowhat.ucsd.edu/homer/chipseq/qc.html#Sequencing_Fragment_Length_Estimation