I have two matrices, one is 200K rows long, the other is 20K. For each row (which is a point) in the first matrix, I am trying to find which row (also a point) in the second matrix is closest to the point in the first matrix. This is the first method that I tried on a sample dataset:
#Test dataset
pixels.latlon=cbind(runif(200000,min=-180, max=-120), runif(200000, min=50, max=85))
grwl.latlon=cbind(runif(20000,min=-180, max=-120), runif(20000, min=50, max=85))
#calculate the distance matrix
library(geosphere)
dist.matrix=distm(pixels.latlon, grwl.latlon, fun=distHaversine)
#Pick out the indices of the minimum distance
rnum=apply(dist.matrix, 1, which.min)
However, I get a Error: cannot allocate vector of size 30.1 Gb
error when I use the distm
function.
There have been several posts on this topic:
This one uses bigmemory
to calculate distances between points in the SAME dataframe, but I'm not sure how to adapt it to calculate distances between points in two different matrices...https://stevemosher.wordpress.com/2012/04/12/nick-stokes-distance-code-now-with-big-memory/
This one also works for calculating a distance matrix between points in the SAME matrix...Efficient (memory-wise) function for repeated distance matrix calculations AND chunking of extra large distance matrices
And this one is pretty much identical to what I want to do, but they didn't actually come up with a solution that worked for large data: R: distm with Big Memory I tried this method, which uses bigmemory
, but get a Error in CreateFileBackedBigMatrix(as.character(backingfile), as.character(backingpath), :
Problem creating filebacked matrix.
error, I think because the dataframe is too large.
Has anyone come up with a good solution to this problem? I am open to other package ideas!
Updated code which fixed the issue
pixels.latlon=cbind(runif(200000,min=-180, max=-120), runif(200000, min=50, max=85))
grwl.tibble = tibble(long=runif(20000,min=-180, max=-120), lat=runif(20000, min=50, max=85), id=runif(20000, min=0, max=20000))
rnum <- apply(pixels.latlon, 1, function(x) {
xlon=x[1]
xlat=x[2]
grwl.filt = grwl.tibble %>%
filter(long < (xlon+0.3) & long >(xlon-0.3) & lat < (xlat+0.3)&lat >(xlat-.3))
grwl.latlon.filt = cbind(grwl.filt$long, grwl.filt$lat)
dm <- distm(x, grwl.latlon.filt, fun=distHaversine)
rnum=apply(dm, 1, which.min)
id = grwl.filt$id[rnum]
return(id)
})