Apologies for the long-winded answer, but there's a lot going on here. First, some housekeeping.
At each iteration of the outer loop, you seem to want to calculate the mean of all the pairwise distances across two random samples of size 4000 and 5000 (so, the mean of 20 million distances). But your code does not do that. distfunc(...)
returns a vector of length 5000. When you try to set that to distvector[i]
your are discarding all but the first distance (this is why you get all those warnings). So at each step you are calculating 5000 distances, discarding 4999 of them, and repeating that process 4000 times. I can only assume this is not your aim, so in the code below I changed that.
Also, your distfunc(...)
looks like an implementation of the Haversine distance formula, but it isn't. The Haversine formula requires long/lat in radians, but you (seem to be...) passing long/lat in degrees. One upshot of this is that your results and the results you would get using R's built in distHaversine(...)
(in the geoshpere
package), or spDistsN1(...)
(in the sp
package), are not comparable. I wasn't clear on what exactly you are trying to accomplish, so I did not change your distfunc(...)
, but you might have to.
Now to the question of how to speed this up.
In this kind of situation it is almost alway useful to profile the code. You can do this with Rprof()
and summaryRprof()
.
Rprof()
nsampling <- 10 # just ten simulations...
distvector <- numeric(nsample1)
meandistv <- numeric(nsampling)
for (n in seq_len(nsampling)) { # loop for simulation
sample1 <- pop[sample(npop,nsample1),]
sample2 <- pop[sample(npop,nsample2),]
for (i in seq_len(nsample1)) {
# note change: mean(distfunc(...)), not distfunc(...)
distvector[i] <- mean(distfunc(sample1[i,1],sample1[i,2],sample2[,1],sample2[,2]))
}
meandistv[n] <- mean(distvector)
}
summaryRprof()
# ...
# $by.total
# total.time total.pct self.time self.pct
# "mean" 36.46 99.95 0.50 1.37
# "distfunc" 35.68 97.81 3.88 10.64
# "sin" 11.84 32.46 11.84 32.46
# "cos" 7.44 20.39 7.44 20.39
# "pmin" 7.18 19.68 5.00 13.71
So the loop is spending 98% of it's time inside distfunc(...)
. Does this make sense? Unfortunately, yes. distfunc(...)
calls either sin(.)
or cos(.)
a total of 4 times for each distance calculation. So, for each n
, 4000*5000*4 = 80 million times.
X <- rnorm(4000*5000*4)
system.time(sin(X))
# user system elapsed
# 4.25 0.09 4.35
So for the 100 iterations in your example, just calculating the sin/cos values will take about 430 sec. Of course these functions are fully vectorized and written in C, so it is not likely that you will be able to improve much on that. One option in principle would be to write a version of distfunc(...)
to calculate all 20 million distances in C, and then call that using the Rcpp
package. In fact one of the answers to your earlier question did exactly that. Did you try it??
Another option is parallel processing. So here is a way to implement your algorithm using multiple cores. It appears that you have a dual core system so this may not help you all that much.
# your solution (slightly modified)
system.time({
distvector <- numeric(nsample1)
meandistv <- numeric(nsampling)
for (n in seq_len(nsampling)) { # loop for simulation
sample1 <- pop[sample(npop,nsample1),]
sample2 <- pop[sample(npop,nsample2),]
for (i in seq_len(nsample1)) {
distvector[i] <- mean(distfunc(sample2[,1],sample2[,2],sample1[i,1],sample1[i,2]))
}
meandistv[n] <- mean(distvector)
}
})
# user system elapsed
# 551.06 0.13 554.70
# parallel processing solution
library(foreach) # for foreach(...)
library(snow) # for makeCluster(...)
library(doSNOW) # for resisterDoSNOW(...)
cl <- makeCluster(8,type="SOCK") # create cluster
registerDoSNOW(cl) # register the cluster
system.time({
meandistv <- foreach(n=seq(nsampling), .inorder=FALSE, .packages=c("foreach","iterators"), .combine=c) %dopar% {
sample1 <- pop[sample(nrow(pop),nsample1),]
sample2 <- pop[sample(nrow(pop),nsample2),]
dists <- foreach(row=iter(sample1,by="row")) %do% {
mean(distfunc(sample2[,1],sample2[,2],row[1],row[2]))
}
mean(unlist(dists))
}
})
# user system elapsed
# 0.30 0.06 224.94
stopCluster(cl)
So you can see that with a 4 core processor (8 threads), I was able to improve execution time by about a factor of 2.5.
Finally, looking at your simulated data, it appears that you are interested in geocodes within about +/- 1/2 degree of (-76,37) (Norfolk, VA??). If your data is really in that narrow a range, you could find a planar transformation appropriate for the centerpoint (EPSG 32147 comes pretty close), transform all the points to that CRS, and then use Euclidean distance on the transformed data. That would likely be much faster than anything described here.