1

I have a large dataset (2.6M rows) with two zip codes and the corresponding latitudes and longitudes, and I am trying to compute the distance between them. I am primarily using the package geosphere to calculate Vincenty Ellipsoid distance between the zip codes but it is taking a massive amount of time for my dataset. What can be a fast way to implement this?

What I tried

library(tidyverse)
library(geosphere)

zipdata <- select(fulldata,originlat,originlong,destlat,destlong)

## Very basic approach
for(i in seq_len(nrow(zipdata))){
  zipdata$dist1[i] <- distm(c(zipdata$originlat[i],zipdata$originlong[i]),
       c(zipdata$destlat[i],zipdata$destlong[i]),
       fun=distVincentyEllipsoid)
}

## Tidyverse approach 
zipdata <- zipdata%>%
 mutate(dist2 = distm(cbind(originlat,originlong), cbind(destlat,destlong), 
   fun = distHaversine))

Both of these methods are extremely slow. I understand that 2.1M rows will never be a "fast" calculation, but I think it can be made faster. I have tried the following approach on a smaller test data without any luck,

library(doParallel)
cores <- 15
cl <- makeCluster(cores)
registerDoParallel(cl)

test <- select(head(fulldata,n=1000),originlat,originlong,destlat,destlong)

foreach(i = seq_len(nrow(test))) %dopar% {
  library(geosphere)
  zipdata$dist1[i] <- distm(c(zipdata$originlat[i],zipdata$originlong[i]),
       c(zipdata$destlat[i],zipdata$destlong[i]),
       fun=distVincentyEllipsoid) 
}
stopCluster(cl)

Can anyone help me out with either the correct way to use doParallel with geosphere or a better way to handle this?

Edit: Benchmarks from (some) replies

## benchmark
library(microbenchmark)
zipsamp <- sample_n(zip,size=1000000)
microbenchmark(
  dave = {
    # Dave2e
    zipsamp$dist1 <- distHaversine(cbind(zipsamp$patlong,zipsamp$patlat),
                                   cbind(zipsamp$faclong,zipsamp$faclat))
  },
  geohav = {
    zipsamp$dist2 <- geodist(cbind(long=zipsamp$patlong,lat=zipsamp$patlat),
                             cbind(long=zipsamp$faclong,lat=zipsamp$faclat),
                             paired = T,measure = "haversine")
  },
  geovin = {
    zipsamp$dist3 <- geodist(cbind(long=zipsamp$patlong,lat=zipsamp$patlat),
                             cbind(long=zipsamp$faclong,lat=zipsamp$faclat),
                             paired = T,measure = "vincenty")
  },
  geocheap = {
    zipsamp$dist4 <- geodist(cbind(long=zipsamp$patlong,lat=zipsamp$patlat),
                             cbind(long=zipsamp$faclong,lat=zipsamp$faclat),
                             paired = T,measure = "cheap")
  }
,unit = "s",times = 100)

# Unit: seconds
# expr        min         lq       mean     median         uq        max neval  cld
# dave 0.28289613 0.32010753 0.36724810 0.32407858 0.32991396 2.52930556   100    d
# geohav 0.15820531 0.17053853 0.18271300 0.17307864 0.17531687 1.14478521   100  b  
# geovin 0.23401878 0.24261274 0.26612401 0.24572869 0.24800670 1.26936889   100   c 
# geocheap 0.01910599 0.03094614 0.03142404 0.03126502 0.03203542 0.03607961   100 a  

A simple all.equal test showed that for my dataset the haversine method is equal to the vincenty method, but has a "Mean relative difference: 0.01002573" with the "cheap" method from the geodist package.

FightMilk
  • 174
  • 1
  • 12
  • 2
    I've done some benchmarks [here](https://gist.github.com/SymbolixAU/0f65a431a53da6bdaf5d60b8ff30eba9) - on pairwise calculations. The summary is, `geosphere` is slow, and you're better off using an `Rcpp` implementation or the `geodist` package. And the [twitter thread](https://twitter.com/bikesRdata/status/1152150759661871104) which inspired it. – SymbolixAU Aug 21 '19 at 00:59
  • 1
    Maybe you can adapt [this answer](https://stackoverflow.com/a/55666677/5793905) to a similar question. – Alexis Aug 21 '19 at 01:08
  • @Alexis this is great, I will give all three replies a try and post the times in an edit. – FightMilk Aug 21 '19 at 11:33
  • 1
    You can even browse the code for some of the other distances [on GitHub](https://github.com/cran/geosphere/blob/master/src/dist.c). – Alexis Aug 21 '19 at 11:53
  • @SymbolixAU since I have decided to go with your suggestion of using the `geodist` function, could you post your comment as an answer? – FightMilk Aug 21 '19 at 13:39
  • If you have a working solution feel free to post it as an answer – SymbolixAU Aug 21 '19 at 21:58

3 Answers3

3

R is a vectorized language, thus the function will operate over all of the elements in the vectors. Since you are calculating the distance between the original and destination for each row, the loop is unnecessary. The vectorized approach is approximately 1000x the performance of the loop.
Also using the distVincentyEllipsoid (or distHaveersine, etc. )directly and bypassing the distm function should also improve the performance.

Without any sample data this snippet is untested.

library(geosphere)

zipdata <- select(fulldata,originlat,originlong,destlat,destlong)

## Very basic approach
zipdata$dist1 <- distVincentyEllipsoid(c(zipdata$originlong, zipdata$originlat), 
       c(zipdata$destlong, zipdata$destlat))

Note: For most of the geosphere functions to work correctly, the proper order is: longitude first then latitude.

The reason the tidyverse approach listed above is slow is the distm function is calculating the distance between every origin and destination which would result in a 2 million by 2 million element matrix.

Dave2e
  • 22,192
  • 18
  • 42
  • 50
  • I will test this and @Symbolix method and post an update. The insight into tidyverse working slowly was very helpful. – FightMilk Aug 21 '19 at 11:30
1

I used @SymbolixAU's suggestion to use the geodist package to perform the 2.1M distance calculations on my datasets. I found it to be significantly faster than the geosphere package for every test (I have added one of them in my main question). The measure=cheap option in the geodist uses the cheap ruler method which has low error rates below distances of 100kms. See the geodist vignette for more information. Given some of my distances were higher than 100km, I settled on using the Vincenty Ellipsoid measure.

FightMilk
  • 174
  • 1
  • 12
1

If you are going to use geosphere, I would either use a fast approximate method like distHaversine, or the still fast and very precise distGeo method. (The distVincenty* these are mainly implemented for curiosity).

Robert Hijmans
  • 40,301
  • 4
  • 55
  • 63