2

I am aiming to identify the nearest entry in dataset 2 to each entry in dataset 1 based on the coordinates in both datasets. Dataset 1 contains 180,000 rows (only 1,800 unique coordinates) and dataset 2 contains contains 4,500 rows (full 4,500 unique coordinates).

I have attempted to replicate the answers from similar questions on stackoverflow. for example:

R - Finding closest neighboring point and number of neighbors within a given radius, coordinates lat-long

Calculating the distance between points in different data frames

However these do not solve the problem in the way I want (they either join the data frames or check the distances within a single dataframe).

The solution in Find the nearest X,Y coordinate using R and related posts are the closest I have found so far.

My issue with the post is that it works out the distance between coordinates within a single dataframe, and I have been unable to understand which parameters to change in RANN::nn2 to do it across two data frames.

Proposed code that doesn't work:

library(RANN)
dataset1[,4]<- nn2(data=dataset1, query=dataset2, k=2)

Notes/Questions:

1) Which dataset should be provided to the query to find the closest value in dataset 2 to a given value in dataset 1?

2) Is there any way to avoid the problem that the datasets seem to need to be the same width (number of columns)?

3) How can the outputs (SRD_ID and distance) be added to the relevant entry in dataset 1?

4) What is the use of eps parameter in the RANN::nn2 function?

The aim is to populate the SRC_ID and distance columns in dataset 1 with the nearest station ID from dataset 2 and the distance between the entry in dataset 1 and the nearest entry in dataset 2.

Below is a table demostrating the expected results. Note: the SRC_ID and distance values are example values I have manually added myself, are almost certainly incorrect and will likely not be replicated by the code.

       id HIGH_PRCN_LAT HIGH_PRCN_LON SRC_ID distance
1 3797987      52.88121     -2.873734     55      350 
2 3798045      53.80945     -2.439163     76     2100

Data:

r details

platform        x86_64-w64-mingw32
version.string  R version 3.5.3 (2019-03-11)

data set 1 input (not narrowed down to unique coordinates)

structure(list(id = c(1L, 2L, 4L, 5L, 
6L, 7L, 8L, 9, 10L, 3L), 
    HIGH_PRCN_LAT = c(52.881442267773, 57.8094538200198, 34.0233529, 
    63.8087900198, 53.6888144440184, 63.4462810678651, 21.6075544376207, 
    78.324442654172, 66.85532539759495, 51.623544596), HIGH_PRCN_LON = c(-2.87377812157822, 
    -2.23454414781635, -3.0984448341, -2.439163178635, -7.396111601421454, 
    -5.162345043546359, -8.63311254098095, 3.813289888829932, 
    -3.994325961186105, -8.9065532453272409), SRC_ID = c(NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA), distance = c(NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA)), row.names = c(NA, 10L), class = "data.frame")

data set 2 input

structure(list(SRC_ID = c(55L, 54L, 23L, 11L, 44L, 21L, 76L, 
5688L, 440L, 61114L), HIGH_PRCN_LAT = c(68.46506, 50.34127, 61.16432, 
42.57807, 52.29879, 68.52132, 87.83912, 55.67825, 29.74444, 34.33228
), HIGH_PRCN_LON = c(-5.0584, -5.95506, -5.75546, -5.47801, -3.42062, 
-6.99441, -2.63457, -2.63057, -7.52216, -1.65532)), row.names = c(NA, 
10L), class = "data.frame")
M--
  • 25,431
  • 8
  • 61
  • 93
Kickball
  • 143
  • 1
  • 2
  • 12
  • If they are lat and long you need to think about using `geosphere::distHaversine` to get the right distance. But finding the closest points works as suggested in the answers. – M-- Apr 18 '19 at 20:17

2 Answers2

3

I wrote up an answer referencing this thread. The function is modified to take care of reporting the distance and avoid hard-coding. Please note that it calculates Euclidean distance.

library(data.table)
#Euclidean distance 
mydist <- function(a, b, df1, x, y){

          dt <- data.table(sqrt((df1[[x]]-a)^2 + (df1[[y]]-b)^2))

          return(data.table(Closest.V1  = which.min(dt$V1),
                            Distance    = dt[which.min(dt$V1)]))
           }

setDT(df1)[, j = mydist(HIGH_PRCN_LAT, HIGH_PRCN_LON, setDT(df2), 
                        "HIGH_PRCN_LAT", "HIGH_PRCN_LON"), 
                         by = list(id, HIGH_PRCN_LAT, HIGH_PRCN_LON)]
  #     id HIGH_PRCN_LAT HIGH_PRCN_LON Closest.V1 Distance.V1
  # 1:   1      52.88144     -2.873778          5   0.7990743
  # 2:   2      57.80945     -2.234544          8   2.1676868
  # 3:   4      34.02335     -3.098445         10   1.4758202
  # 4:   5      63.80879     -2.439163          3   4.2415854
  # 5:   6      53.68881     -7.396112          2   3.6445416
  # 6:   7      63.44628     -5.162345          3   2.3577811
  # 7:   8      21.60755     -8.633113          9   8.2123762
  # 8:   9      78.32444      3.813290          7  11.4936496
  # 9:  10      66.85533     -3.994326          1   1.9296370
  # 10:  3      51.62354     -8.906553          2   3.2180026

You can use RANN::nn2, but you need to make sure to use the right syntax. Following works!

as.data.frame(RANN::nn2(df2[,c(2,3)],df1[,c(2,3)],k=1))
#    nn.idx   nn.dists
# 1       5  0.7990743
# 2       8  2.1676868
# 3      10  1.4758202
# 4       3  4.2415854
# 5       2  3.6445416
# 6       3  2.3577811
# 7       9  8.2123762
# 8       7 11.4936496
# 9       1  1.9296370
# 10      2  3.2180026
M--
  • 25,431
  • 8
  • 61
  • 93
  • `df1[ , c(4,5)] <- as.data.frame(RANN::nn2(df2[,c(2,3)],df1[,c(2,3)],k=1))` to assign the result to desired columns. – M-- Apr 18 '19 at 21:11
  • Thank you for the solution, the second version that used RANN::nn2 worked very quickly despite several hundred thousand rows. It appears that the snippit provided in your comment will populate column 4 of dataset1 with the row number of the nearest entry in dataset 2. Do you have any suggestions of how this could be adapted to input the SRC_ID value from dataset 2 instead? – Kickball Apr 18 '19 at 22:05
  • @Kickball: `df1[ , c(4,5)] <- as.data.frame(RANN::nn2(df2[,c(2,3)],df1[,c(2,3)],k=1)); df1[,4] <- df2[df1[, 4], 1]` – M-- Apr 18 '19 at 22:14
  • Thank you, that successfully allocated SRC_ID to dataset 1. I should have mentioned it earlier but do you know if there is a similar function to RANN:nn2 that uses ideally Ellipsoidal/Vincenty or Spherical/Haversine distances? This would work better for my research. I haven't had much luck beyond finding geosphere::distVincentyEllipsoid but I am not sure how to adapt the code to the different function. If you have any ideas it would be greatly appreciated! – Kickball Apr 18 '19 at 22:42
  • @Kickball Just left my office. If I were you, I would've posted another question, referencing to this solution and asking how to do it for other methods. Unfortunately, I don't have the answer off the top of my head and now am replying on my phone. Good luck. – M-- Apr 18 '19 at 22:47
  • Thanks I will post a new question and link back to this. – Kickball Apr 18 '19 at 22:49
  • 1
    geosphere::distGeo or the identical raster::pointDistance are better (more precise) options than Haversine etc. – Robert Hijmans Apr 19 '19 at 04:18
  • @RobertHijmans, I understand the relative accuracy of `distVincentyEllipsoid` over `distHaversine`, is there a FAQ or discussion that brings `distGeo` into the comparison of relative accuracy? (I use `geosphere` daily, btw, and have come to trust and appreciate its speed and utility. Thank you for it!) – r2evans Mar 17 '22 at 13:22
  • See the "Background" section and the Karney paper referred to here: https://geographiclib.sourceforge.io/html/js/tutorial-1-geodesics.html – Robert Hijmans Mar 19 '22 at 07:45
2

Data

x = structure(list(id = c(1L, 2L, 4L, 5L, 
6L, 7L, 8L, 9, 10L, 3L), 
    HIGH_PRCN_LAT = c(52.881442267773, 57.8094538200198, 34.0233529, 
    63.8087900198, 53.6888144440184, 63.4462810678651, 21.6075544376207, 
    78.324442654172, 66.85532539759495, 51.623544596), HIGH_PRCN_LON = c(-2.87377812157822, 
    -2.23454414781635, -3.0984448341, -2.439163178635, -7.396111601421454, 
    -5.162345043546359, -8.63311254098095, 3.813289888829932, 
    -3.994325961186105, -8.9065532453272409), SRC_ID = c(NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA), distance = c(NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA)), row.names = c(NA, 10L), class = "data.frame")

y = structure(list(SRC_ID = c(55L, 54L, 23L, 11L, 44L, 21L, 76L, 
 5688L, 440L, 61114L), HIGH_PRCN_LAT = c(68.46506, 50.34127, 61.16432, 
 42.57807, 52.29879, 68.52132, 87.83912, 55.67825, 29.74444, 34.33228
 ), HIGH_PRCN_LON = c(-5.0584, -5.95506, -5.75546, -5.47801, -3.42062, 
 -6.99441, -2.63457, -2.63057, -7.52216, -1.65532)), row.names = c(NA, 
 10L), class = "data.frame")

Solution. Note the "3:2" to get "longitude / latitude", in that order.

library(raster)

d <- pointDistance(x[,3:2], y[,3:2], lonlat=TRUE, allpairs=T) 
i <- apply(d, 1, which.min)

x$SRC_ID = y$SRC_ID[i]
x$distance = d[cbind(1:nrow(d), i)]
x

#   id HIGH_PRCN_LAT HIGH_PRCN_LON SRC_ID   distance
#1   1      52.88144     -2.873778     44   74680.48
#2   2      57.80945     -2.234544   5688  238553.51
#3   4      34.02335     -3.098445  61114  137385.18
#4   5      63.80879     -2.439163     23  340642.70
#5   6      53.68881     -7.396112     44  308458.73
#6   7      63.44628     -5.162345     23  256176.88
#7   8      21.60755     -8.633113    440  908292.28
#8   9      78.32444      3.813290     76 1064419.47
#9  10      66.85533     -3.994326     55  185119.29
#10  3      51.62354     -8.906553     54  251580.45

Illustrated

plot(x[,3:2], ylim=c(0,90), col="blue", pch=20)
points(y[,3:2], col="red", pch=20)
for (i in 1:nrow(x)) {
    j <- y$SRC_ID==x$SRC_ID[i]
    arrows(x[i,3], x[i,2], y[j,3], y[j,2],length=.1) 
}

text(x[,3:2], labels=x$id, pos=1, cex=.75)
text(y[,3:2], labels=y$SRC_ID, pos=3, cex=.75)

plot

Robert Hijmans
  • 40,301
  • 4
  • 55
  • 63
  • 1
    Thank you for the detailed response include a visualisation! Unfortunately, when I run your code against the full size datasets the following error is thrown at the `i <- apply(d, 1, which.min)` stage, `Error: cannot allocate vector of size 5.7 Gb` due to r consuming more than 11GB of my system's RAM. I could split my unique coordinate pairs into a separate dataframe/set and then use that as dataset 1 but I will try the other solution first. – Kickball Apr 18 '19 at 21:52