6

I have two data set of different stations. The data are basically data.frames with coordinates, longitudes and latitudes. Given the first data set (or vice versa), I want to find the nearest station for each station in the other data set. My main problem here is that the coordinates are not ordered and that the data sets have different lengths. For example, the first one contains 2228 stations ,and the second one 1782. So, I don't know how handle this. I know about the function rdist.earth and I tried to use it. This is a short sample of this:

      #First data set of stations
        set1 <- structure(list(lon = c(13.671114, 12.866947, 15.94223, 11.099736,  
         12.958342, 14.203892, 11.86389, 16.526674, 16.193064, 17.071392
        ), lat = c(48.39167, 48.148056, 48.721111, 47.189167, 47.054443, 
         47.129166, 47.306667, 47.84, 47.304167, 48.109444)), .Names = c("lon", 
       "lat"), row.names = c(NA, 10L), class = "data.frame")

      #Second data set
      set2 <- structure(list(lon = structure(c(14.4829998016357, 32.4000015258789, 
      -8.66600036621094, 15.4670000076294, 18.9160003662109, 19.0160007476807, 
      31.0990009307861, 14.3660001754761, 9.59899997711182, 11.0830001831055
       ), .Dim = 10L), lat = structure(c(35.8499984741211, 34.75, 70.9329986572266, 
      78.25, 69.6829986572266, 74.515998840332, 70.3659973144531, 67.265998840332, 
       63.6990013122559, 60.1990013122559), .Dim = 10L)), .Names = c("lon", 
      "lat"), row.names = c(NA, 10L), class = "data.frame")
       #computing distance
       dd<- rdist.earth(set1,set2,miles=FALSE)

Now I have the matrix dd, with the distances..but I don't know how find the information for each point. I mean, for example, from the data set 1, the first point, what is the nearest station in the second data set? Any idea??

Thanks a lot.

talat
  • 68,970
  • 21
  • 126
  • 157
user3231352
  • 799
  • 1
  • 9
  • 26
  • This sounds similar to this question http://stackoverflow.com/questions/27329276/double-for-loop-operation-in-r-with-an-example/27336678#27336678 – tospig Dec 12 '14 at 11:36

5 Answers5

19

Here is an other possible solution:

library(rgeos)
set1sp <- SpatialPoints(set1)
set2sp <- SpatialPoints(set2)
set1$nearest_in_set2 <- apply(gDistance(set1sp, set2sp, byid=TRUE), 1, which.min)

head(set1)
       lon      lat nearest_in_set2
## 1 13.67111 48.39167              10
## 2 12.86695 48.14806              10
## 3 15.94223 48.72111              10
## 4 11.09974 47.18917               1
## 5 12.95834 47.05444               1
## 6 14.20389 47.12917               1
johannes
  • 14,043
  • 5
  • 40
  • 51
  • I have the same problem but this solution doesn't work for me. I get the error message: Error in `$<-.data.frame`(`*tmp*`, nearest_in_OBS, value = c(`1` = 1L, : replacement has 12375 rows, data has 504 Does the above solution not work for data sets of different lengths? – DJ-AFC Jul 11 '19 at 11:10
5

You can use a series of apply commands to do this. Note that the x and y in the functions refer to set1 and set2 rather than the lat lon coords - the lat lon coords are specified as p1 and p2. [NOTE: Edited to correct order of set1 and set2 in calculations - the order determines if you are calculating the value in set2 closest to each value in set 1 or vice-versa)

distp1p2 <- function(p1,p2) {
    dst <- sqrt((p1[1]-p2[1])^2+(p1[2]-p2[2])^2)
    return(dst)
}

dist2 <- function(y) min(apply(set2, 1, function(x) min(distp1p2(x,y))))

apply(set1, 1, dist2)

Or if you want the station with the nearest point rather than the min distance change min to which.min in dist2()

dist2b <- function(y) which.min(apply(set2, 1, function(x) min(distp1p2(x,y))))
apply(set1, 1, dist2b)

And to get the lat-lon for that station

set2[apply(set1, 1, dist2b),]
4

If you have extremely large datasets, using a distance command can be cumbersome as it must calculate the distance to all points in the alternative data for each point in the reference data. The 'ann' command from the 'yaImpute' package is a very fast approximate nearest-neighbour routine that is good for large distance calculations. It will return however many "closest" records you want (the value of k) as well as the distance to each of them.

Note: despite being an approximate nearest neighbour, the results are stable on repeated runs of the same data. It doesn't include a random selection of points or anything. See documentation.

FWIW, I'm really not kidding about fast. I've used this to find knn distances for two matrices, each with millions of points. Making a distance matrix for this or doing it iteratively row-by-row is either unfeasible or painfully slow.

Quick example:

# Hypothetical coordinate data
set.seed(2187); foo1 <- round(abs(data.frame(x=runif(5), y=runif(5))*100))
set.seed(2187); foo2 <- round(abs(data.frame(x=runif(10), y=runif(10))*100))
foo1; foo2

# the 'ann' command from the 'yaImpute' package
install.packages("yaImpute")
library(yaImpute)

# Approximate nearest-neighbour search, reporting 2 nearest points (k=2)
# This command finds the 3 nearest points in foo2 for each point in foo1
# In the output:
#   The first k columns are the row numbers of the points
#   The next k columns (k+1:2k) are the *squared* euclidean distances
knn.out <- ann(as.matrix(foo2), as.matrix(foo1), k=3)
knn.out$knnIndexDist

     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1    5    4  729 1658 2213
[2,]    2    3    7   16  100 1025
[3,]    9    7    5   40   81  740
[4,]    4    1    6   16  580  673
[5,]    5    7    9    0  677  980

https://cran.r-project.org/web/packages/yaImpute/index.html

David Roberts
  • 617
  • 1
  • 11
  • 23
  • System time comparison for a 232 row reference & 14,124 row alternative: 1) apply method = 3.89 sec 2) ann method = 0.02 sec – David Roberts Sep 25 '17 at 22:18
1

The function s2_closest_feature() from the s2 package finds nearest points from different data sets.

For example, with your data:

library(s2)
set1_s2 <- s2_lnglat(set1$lon, set1$lat)
set2_s2 <- s2_lnglat(set2$lon, set2$lat)
set1$closest <- s2_closest_feature(set1_s2, set2_s2)
set1
#>         lon      lat closest
#> 1  13.67111 48.39167      10
#> 2  12.86695 48.14806      10
#> 3  15.94223 48.72111      10
#> 4  11.09974 47.18917       1
#> 5  12.95834 47.05444       1
#> 6  14.20389 47.12917       1
#> 7  11.86389 47.30667       1
#> 8  16.52667 47.84000       1
#> 9  16.19306 47.30417       1
#> 10 17.07139 48.10944       1
mharinga
  • 1,708
  • 10
  • 23
0

I don't exactly know what you want, but maybe this gives you some hints
if you want to get the min value for each column

  dd <- as.data.frame(dd)
  sapply(dd, min)
  paste(rownames(dd), ":", apply(dd,2,which.min)) #or