"Is there a faster alternative for this for loop, where I need to multiply each row once with the other rows?"

Question

This for loop takes to long to run is there another alternative?

for (i in 1:nrow(petrolStations)) {
k<-i+1
if(k<=nrow(petrolStations)) {
for(j in k:nrow(petrolStations)) {
distancesToStation[i,j] <- ,        
as.data.frame(a s.numeric(distm(petrolStations[i, c("lon", "lat")],
petrolStations[j, c("lon", "lat")], fun = distHaversine)/1000))}
}}

Welcome to SO, you will get an answer to your question quicker if you provide a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). So, provide `petrolStations` (subset of it or few lines of artificial data) and correct the code you posted. — alko989, May 24 '19 at 16:00
More the point, please provide your data in an *unmabiguous* format such as: `dput(head(petrolStations))`. (How data is shown on the console is typically different from how it is stored internally, and some of those differences change how things are done on the code side.) It's also strongly encouraged to list all non-standard packages being used; I'm guessing that you are using `geosphere`, please add that to your question when you add your data. — r2evans, May 24 '19 at 16:09

score 0 · Accepted Answer · answered May 24 '19 at 16:37

I'll use my own sample data:

set.seed(2)
y <- data.frame(lon = rnorm(10, mean = -114.4069597, sd = 0.0001),
                lat = rnorm(10, mean = 43.660648, sd = 0.0002) )

I'm guessing your reason for doing the double-loop is so that you don't calculate each distance twice. If you use the base dist function in general, it provides a lower-triangle output, not calculating the upper-triangle. The method below mimics this behavior.

nr <- nrow(y)
out <- sapply(seq_len(nr), function(i) {
  if (i == nr) return(c(rep(NA_real_, i - 1), 0))
  c(rep(NA_real_, i - 1), 0,
    geosphere::distHaversine(y[i,,drop = FALSE],
                             y[(i+1):nr,,drop = FALSE]))
})
out
#         [,1]   [,2]  [,3]  [,4]  [,5]  [,6]  [,7]   [,8]  [,9] [,10]
#  [1,]  0.000     NA    NA    NA    NA    NA    NA     NA    NA    NA
#  [2,] 15.285  0.000    NA    NA    NA    NA    NA     NA    NA    NA
#  [3,] 26.943 32.620  0.00    NA    NA    NA    NA     NA    NA    NA
#  [4,] 32.500 46.234 26.20  0.00    NA    NA    NA     NA    NA    NA
#  [5,] 31.085 17.949 50.25 63.39  0.00    NA    NA     NA    NA    NA
#  [6,] 61.315 73.312 44.29 30.08 91.15  0.00    NA     NA    NA    NA
#  [7,] 16.503  4.798 29.18 45.20 21.10 71.17  0.00     NA    NA    NA
#  [8,] 10.014 21.336 17.54 25.00 38.90 52.34 20.26  0.000    NA    NA
#  [9,] 26.722 14.509 31.46 52.13 23.87 75.49 10.71 28.178  0.00    NA
# [10,]  6.114 12.508 23.04 33.73 30.06 61.12 12.05  8.864 21.43     0

Arbitrary verification:

geosphere::distHaversine(y[8,], y[2,])
# [1] 21.33617

This is faster than your code because it capitalizes on vectorized calculations: geosphere::distHaversine can calculate multiple distances at once:

between-points (if its second argument is missing);
between all points in p1 with the corresponding points in p2 (both p1 and p2 have same number of rows); or
as I'm doing above, a single points against many points.

The c(rep(NA_real_, i - 1), 0, ...) is to ensure the upper-triangle is NA and the diagonal is 0. The first conditional (i==nr) is a cheat to make sure we have a square matrix, and the last column is all-NA and a 0.

If you need the upper-triangle populated as well:

out[upper.tri(out)] <- t(out)[upper.tri(out)]
out
#         [,1]   [,2]  [,3]  [,4]  [,5]  [,6]   [,7]   [,8]  [,9]  [,10]
#  [1,]  0.000 15.285 26.94 32.50 31.08 61.31 16.503 10.014 26.72  6.114
#  [2,] 15.285  0.000 32.62 46.23 17.95 73.31  4.798 21.336 14.51 12.508
#  [3,] 26.943 32.620  0.00 26.20 50.25 44.29 29.178 17.539 31.46 23.037
#  [4,] 32.500 46.234 26.20  0.00 63.39 30.08 45.201 24.996 52.13 33.730
#  [5,] 31.085 17.949 50.25 63.39  0.00 91.15 21.096 38.903 23.87 30.059
#  [6,] 61.315 73.312 44.29 30.08 91.15  0.00 71.166 52.336 75.49 61.116
#  [7,] 16.503  4.798 29.18 45.20 21.10 71.17  0.000 20.257 10.71 12.052
#  [8,] 10.014 21.336 17.54 25.00 38.90 52.34 20.257  0.000 28.18  8.864
#  [9,] 26.722 14.509 31.46 52.13 23.87 75.49 10.706 28.178  0.00 21.435
# [10,]  6.114 12.508 23.04 33.73 30.06 61.12 12.052  8.864 21.43  0.000

"Is there a faster alternative for this for loop, where I need to multiply each row once with the other rows?"

1 Answers1