0

This question is a follow-up from Double for-loop operation in R (with an example).

I obtain two random samples, calculate pair-wise distances, and get the mean of them. Then, I repeat this many times, and store the means into a vector.

(I thank docendo discimus for the "semi-vectorization" method for calculating distances.)

ptm <- proc.time()

############################################
## (1) Define (vectorized) distance function 
############################################

distfunc <- function(lon1,lat1,lon2,lat2) {
  a <- sin((lat2-lat1)/2)^2+cos(lat1)*cos(lat2)*sin((lon2-lon1)/2)^2
  c <- 2 * asin(pmin(1,sqrt(a)))
  return(6.371*c)
}

#########################################
## (2) Pseudo Population data
#########################################
npop <- 200000
pop <- cbind(runif(npop, min=0, max=1)-76,runif(npop, min=0, max=1)+37)

#########################################
## (3) The number of simulation
#########################################
nsampling <- 100

########################
## (4) Initiate vectors
########################
nsample1 <- 4000
nsample2 <- 5000
distvector <- NULL
meandistv <- NULL

########################################################
## (5) Loop for simulation (# = nsampling)
########################################################
for (n in seq_len(nsampling)) { # loop for simulation

# Pseudo sample data sets
sample1 <- pop[sample(npop,nsample1),]
sample2 <- pop[sample(npop,nsample2),]

 for (i in seq_len(nsample1)) {
  distvector[i] = distfunc(sample1[i,1],sample1[i,2],sample2[,1],sample2[,2])
 }

meandistv[n] <- mean(distvector)

}

proc.time() - ptm

The computation time ended up with 10 minutes (-ish):

> proc.time() - ptm
   user  system elapsed 
 629.74    0.14  632.35

Question: The number of simulations here is just 100. I in fact need to raise the number of simulations up to 1000. 2000 would be better. Is there any room for speeding it up? I currently research some parallel computing techniques. By the way, I am worried that the apply type approaches consume a significant of RAM, and probably are not feasible for big data with a large number of simulations.

Community
  • 1
  • 1
Bill TP
  • 1,037
  • 2
  • 16
  • 29
  • Without looking into the details: Pre-allocate your result vectors. E.g., `distvector <- numeric(nsample1)` and `meandistv <- numeric(nsampling)` . Most of the time is spent growing those vectors. – Roland Dec 31 '14 at 19:13
  • Right. Thanks, Roland. – Bill TP Dec 31 '14 at 19:28
  • Have you tried replacing those loops with a data.table cross-join? You could then use the := operator to calculate the distance in a 5th column rowwise. – Serban Tanasa Dec 31 '14 at 19:49
  • Too bad this was marked as duplicate. I had a look at the other answer and it doesn't really help at all in your case. It looks like you are calculating Haversine distance. There are packages for that in R that might be faster and certainly can avoid all these loops. Have a look as `distHaversine(...)` in the `geosphere` package. Then, something like `mean(apply(sample1,1,function(row)distHaversine(row,sample2)))` would replace the inner loop. I could give you a better answer if the question wasn't embargoed. – jlhoward Jan 01 '15 at 01:11
  • Thank you, jlhoward. I will try the `apply` family, and come back here again. – Bill TP Jan 01 '15 at 05:22
  • 1
    @jlhoward I've reopened it. Feel free to post an answer. – Roland Jan 01 '15 at 15:21

1 Answers1

2

Apologies for the long-winded answer, but there's a lot going on here. First, some housekeeping.

At each iteration of the outer loop, you seem to want to calculate the mean of all the pairwise distances across two random samples of size 4000 and 5000 (so, the mean of 20 million distances). But your code does not do that. distfunc(...) returns a vector of length 5000. When you try to set that to distvector[i] your are discarding all but the first distance (this is why you get all those warnings). So at each step you are calculating 5000 distances, discarding 4999 of them, and repeating that process 4000 times. I can only assume this is not your aim, so in the code below I changed that.

Also, your distfunc(...) looks like an implementation of the Haversine distance formula, but it isn't. The Haversine formula requires long/lat in radians, but you (seem to be...) passing long/lat in degrees. One upshot of this is that your results and the results you would get using R's built in distHaversine(...) (in the geoshpere package), or spDistsN1(...) (in the sp package), are not comparable. I wasn't clear on what exactly you are trying to accomplish, so I did not change your distfunc(...), but you might have to.

Now to the question of how to speed this up.

In this kind of situation it is almost alway useful to profile the code. You can do this with Rprof() and summaryRprof().

Rprof()
nsampling <- 10   # just ten simulations...
distvector <- numeric(nsample1)
meandistv  <- numeric(nsampling)
for (n in seq_len(nsampling)) { # loop for simulation
  sample1 <- pop[sample(npop,nsample1),]
  sample2 <- pop[sample(npop,nsample2),]
  for (i in seq_len(nsample1)) {
    # note change: mean(distfunc(...)), not distfunc(...)
    distvector[i] <- mean(distfunc(sample1[i,1],sample1[i,2],sample2[,1],sample2[,2]))
  }
  meandistv[n] <- mean(distvector)
}  
summaryRprof()
# ...
# $by.total
#                    total.time total.pct self.time self.pct
# "mean"                  36.46     99.95      0.50     1.37
# "distfunc"              35.68     97.81      3.88    10.64
# "sin"                   11.84     32.46     11.84    32.46
# "cos"                    7.44     20.39      7.44    20.39
# "pmin"                   7.18     19.68      5.00    13.71

So the loop is spending 98% of it's time inside distfunc(...). Does this make sense? Unfortunately, yes. distfunc(...) calls either sin(.) or cos(.) a total of 4 times for each distance calculation. So, for each n, 4000*5000*4 = 80 million times.

X <- rnorm(4000*5000*4)
system.time(sin(X))
#    user  system elapsed 
#    4.25    0.09    4.35 

So for the 100 iterations in your example, just calculating the sin/cos values will take about 430 sec. Of course these functions are fully vectorized and written in C, so it is not likely that you will be able to improve much on that. One option in principle would be to write a version of distfunc(...) to calculate all 20 million distances in C, and then call that using the Rcpp package. In fact one of the answers to your earlier question did exactly that. Did you try it??

Another option is parallel processing. So here is a way to implement your algorithm using multiple cores. It appears that you have a dual core system so this may not help you all that much.

# your solution (slightly modified)
system.time({
  distvector <- numeric(nsample1)
  meandistv  <- numeric(nsampling)
  for (n in seq_len(nsampling)) { # loop for simulation
    sample1 <- pop[sample(npop,nsample1),]
    sample2 <- pop[sample(npop,nsample2),]
    for (i in seq_len(nsample1)) {
      distvector[i] <- mean(distfunc(sample2[,1],sample2[,2],sample1[i,1],sample1[i,2]))
    }
    meandistv[n] <- mean(distvector)
  }
})
#    user  system elapsed 
#  551.06    0.13  554.70 

# parallel processing solution
library(foreach)   # for foreach(...)
library(snow)      # for makeCluster(...)
library(doSNOW)    # for resisterDoSNOW(...)

cl <- makeCluster(8,type="SOCK")  # create cluster
registerDoSNOW(cl)                # register the cluster

system.time({
  meandistv <- foreach(n=seq(nsampling), .inorder=FALSE, .packages=c("foreach","iterators"), .combine=c) %dopar% {
    sample1 <- pop[sample(nrow(pop),nsample1),]
    sample2 <- pop[sample(nrow(pop),nsample2),]
    dists   <- foreach(row=iter(sample1,by="row")) %do% {
      mean(distfunc(sample2[,1],sample2[,2],row[1],row[2]))
    }
    mean(unlist(dists))
  }
})
#    user  system elapsed 
#    0.30    0.06  224.94 
stopCluster(cl)

So you can see that with a 4 core processor (8 threads), I was able to improve execution time by about a factor of 2.5.

Finally, looking at your simulated data, it appears that you are interested in geocodes within about +/- 1/2 degree of (-76,37) (Norfolk, VA??). If your data is really in that narrow a range, you could find a planar transformation appropriate for the centerpoint (EPSG 32147 comes pretty close), transform all the points to that CRS, and then use Euclidean distance on the transformed data. That would likely be much faster than anything described here.

jlhoward
  • 58,004
  • 7
  • 97
  • 140