I'm doing SQL style join of two large, fixed size (lat,long) coordinate datasets by nearest neighbor search. Currently I'm using dplyr and data.table to do this. How can I optimize and parallelize my code for absolute runtime?
Previous attempts include native python, pandas, and multiprocessing which ended up being very slow. My current solution, using data.table to construct a table of nearest neighbors and dplyr to join based on this table, is the quickest but is still too slow.
library(dplyr)
library(data.table)
library(geosphere)
source <- data.table(lat = runif(1e3), long = runif(1e3)) %>% mutate(nrow = row_number())
dest <- data.table(lat = runif(5e4), long = runif(5e4)) %>% mutate(ind = row_number())
dest_mat <- as.matrix(dest[, c('long', 'lat')])
setDT(source)
# function that returns the index of the closest matching point in dest
mindist_ind <- function(source_lat, source_long) { return(which.min(distHaversine(c(source_long, source_lat), dest_mat))) }
nn_inds <- source[, j = list(ind = mindist_ind(lat, long)), by = 1:nrow(source)] # slowest line, gets index of nearest match in dest
nn_matches <- inner_join(nn_inds, dest, by = 'ind') # join final back to dest in order to get all nearest matches
sourcedest_matches <- inner_join(source, nn_matches, by = 'nrow') # join nearest matches to source by index
The source file is ~89 million rows, and dest is roughly ~50k rows. Current timing for various source sizes are as follows:
- 1000 rows -> 46 seconds
- 10000 rows -> 270 seconds
- 100000 rows -> 2580 seconds
- 1000000 rows -> 17172 seconds
While this is the quickest I've been able to get, for the full 89 million source file it would take an estimated 17-18 days to run which is far too long. I'm running this on a r4.16xlarge AWS EC2 instance, with 488 GB RAM, 32 cores, and 64 vCPUs. How can I optimize/parallelize this code to run quicker?