0

I have to data frames of points (lat+lon pairs). For each point in the first data frame, I would like to find the closest point in the second data frame. This question and answer gets me almost there (Finding closest coordinates between two large data sets). The only difference, as suggested in a comment, is that I would like to use geosphere::distGeo or raster::pointDistance instead of what is in the answer. I have tried doing this (see code below), but get an error. How can I fix it? And what units is the Distance variable in? I would like it to be in miles.

Code - copied directly from aforementioned question, except (incorrectly) modified to try to get what I want.

df1 <- structure(list(id = c(1L, 2L, 4L, 5L, 
                      6L, 7L, 8L, 9, 10L, 3L), 
               HIGH_PRCN_LAT = c(52.881442267773, 57.8094538200198, 34.0233529, 
                              63.8087900198, 53.6888144440184, 63.4462810678651, 21.6075544376207, 
                              78.324442654172, 66.85532539759495, 51.623544596), HIGH_PRCN_LON = c(-2.87377812157822, 
                                                                                                 -2.23454414781635, -3.0984448341, -2.439163178635, -7.396111601421454, 
                                                                                                 -5.162345043546359, -8.63311254098095, 3.813289888829932, 
                                                                                                 -3.994325961186105, -8.9065532453272409), SRC_ID = c(NA, NA, 
                                                                                                                                                     NA, NA, NA, NA, NA, NA, NA, NA), distance = c(NA, NA, 
                                                                                                                                                                                                  NA, NA, NA, NA, NA, NA, NA, NA)), row.names = c(NA, 10L), class = "data.frame")


df2 <- structure(list(SRC_ID = c(55L, 54L, 23L, 11L, 44L, 21L, 76L, 
                          5688L, 440L, 61114L), HIGH_PRCN_LAT = c(68.46506, 50.34127, 61.16432, 
                                                                42.57807, 52.29879, 68.52132, 87.83912, 55.67825, 29.74444, 34.33228
                          ), HIGH_PRCN_LON = c(-5.0584, -5.95506, -5.75546, -5.47801, -3.42062, 
                                             -6.99441, -2.63457, -2.63057, -7.52216, -1.65532)), row.names = c(NA, 
                                                                                                              10L), class = "data.frame")

library(data.table)
library(raster)
library(geosphere)
#Euclidean distance 
mydist <- function(a, b, df1, x, y){
    
    #dt <- data.table(sqrt((df1[[x]]-a)^2 + (df1[[y]]-b)^2))
    #I couldn't figure out how to modify this correcly:
    dt <- data.table(pointDistance(c(a,b), c(df1[[x]],df1[[y]]), lonlat=TRUE))
    #dt <- data.table(distGeo(c(a,b), c(df1[[x]],df1[[y]]), lonlat=TRUE))
    
    return(data.table(Closest.V1  = which.min(dt$V1),
                      Distance    = dt[which.min(dt$V1)]))
}

setDT(df1)[, j = mydist(HIGH_PRCN_LAT, HIGH_PRCN_LON, setDT(df2), 
                        "HIGH_PRCN_LAT", "HIGH_PRCN_LON"), 
           by = list(id, HIGH_PRCN_LAT, HIGH_PRCN_LON)]

This is the error I get:

Error in .pointsToMatrix(p2) : Wrong length for a vector, should be 2
bill999
  • 2,147
  • 8
  • 51
  • 103
  • 1
    Perhaps also update above with your error returned so when we get the same or different error... – Chris Mar 17 '22 at 03:58
  • Good point @Chris - I just updated. – bill999 Mar 17 '22 at 04:00
  • And now, for something completely bogus, as to my confidence in what I'm saying..., perhaps instantiate setDT 1 & 2 prior to passing into `my_dist`, as I think you have an evn problem here with setDT(df2), as within the function env df2 is likely empty and might explain the seeming 'second point not found-ish' to calc a distance from error. Or, as usual, I'm completely wrong, and that would be more usual. – Chris Mar 17 '22 at 05:01
  • Thanks, @Chris. I am not sure either. I do know that the `dt <- data.table(sqrt((df1[[x]]-a)^2 + (df1[[y]]-b)^2))` code works (in case that is relevant). – bill999 Mar 17 '22 at 05:08
  • I'm blind or something. What are `a` and `b` here and if not found in either df1 or 2, are they some sort of constants (perhaps shift) that went wayward? – Chris Mar 17 '22 at 05:26
  • I think they are just arguments to the function. – bill999 Mar 17 '22 at 05:32
  • I'd suggest dropping `my_dist` approach and going with [Hijmans'](https://stackoverflow.com/questions/55752064/finding-closest-coordinates-between-two-large-data-sets?rq=1) approach and answer, he did write `raster` after all and probably has a pretty good sense of these things. Note to note, the column swap there. – Chris Mar 17 '22 at 05:41
  • Above code doesn't search all combinations, but ony makes row wise comparison – Waldi Mar 17 '22 at 08:35
  • Thanks @Chris - for some reason I didn't see that answer, but I should have looked more closely. – bill999 Mar 17 '22 at 13:24

1 Answers1

1

c(df1[[x]],df1[[y]]) should be changed to df1[, c(y,x), with=FALSE] because geosphere::distGeo / raster::pointDistance expects a 2-column matrix of points. Also, it expects the points to be supplied as (longitude, latitude) - hence the (x,y) should be flipped to (y,x).

Using raster::pointDistance:

# Using raster::pointDistance
library(data.table)
library(raster)
library(geosphere)

setDT(df1)
setDT(df2)

mydist <- function(a, b, df1, x, y) {
    dt <- data.table(pointDistance(c(b,a), df1[, c(y,x), with=FALSE], lonlat=TRUE))
    
    return (
        data.table(Closest.V1 = which.min(dt$V1),
        Distance = dt[which.min(dt$V1)])
    )
}

df1[, j = mydist(HIGH_PRCN_LAT, HIGH_PRCN_LON,
    df2, "HIGH_PRCN_LAT", "HIGH_PRCN_LON"),
    by = list(id, HIGH_PRCN_LAT, HIGH_PRCN_LON)]

enter image description here

Using geosphere::distGeo (same result as above):

# Using geosphere::distGeo
mydist <- function(a, b, df1, x, y) {
    dt <- data.table(distGeo(c(b,a), df1[, c(y,x), with=FALSE]))
    
    return (
        data.table(Closest.V1 = which.min(dt$V1),
        Distance = dt[which.min(dt$V1)])
    )
}

df1[, j = mydist(HIGH_PRCN_LAT, HIGH_PRCN_LON,
    df2, "HIGH_PRCN_LAT", "HIGH_PRCN_LON"),
    by = list(id, HIGH_PRCN_LAT, HIGH_PRCN_LON)]