Match two datasets by minimum geospatial distance (R)

Question

I have the two following datasets:

houses <- data.table(house_number = c(1:3),
                     lat_decimal = seq(1.1, 1.3, by = 0.1),
                     lon_decimal = seq(1.4, 1.6, by = 0.1))
stations <- data.table(station_numer = c(1:11),
                       lat_decimal = seq(1, 2, by = 0.1),
                       lon_decimal = seq(2, 3, by = 0.1))

I want to merge houses and stations together such that the resulting station_number is the station that's closest to the corresponding house_number.

This question is very similar, but I'm not sure if they're working with latitude and longitude and also, I don't know how to calculate distances when dealing with longitude and latitude (which is why I prefer to simply use distm from the geosphere package).

I have never worked with the outer function. In case the answer from the aforementioned question would work, how can I adapt the answer to use the distmfunction instead of the sqrtfunction?

score 3 · Answer 1 · answered Feb 09 '19 at 14:47

Use match_nrst_haversine from hutilscpp:

library(hutilscpp)
houses[, c("station_number", "dist") := match_nrst_haversine(lat_decimal,
                                                             lon_decimal,
                                                             addresses_lat = stations$lat_decimal,
                                                             addresses_lon = stations$lon_decimal,
                                                             Index = stations$station_numer,
                                                             close_enough = 0,
                                                             cartesian_R = 5)]

houses
#>    house_number lat_decimal lon_decimal station_number     dist
#> 1:            1         1.1         1.4              1 67.62617
#> 2:            2         1.2         1.5              1 59.87076
#> 3:            3         1.3         1.6              1 55.59026

You may want to adjust close_enough and cartesian_R if your data are numerous (i.e. over a million points to match) for performance.

`cartesian_R`
The maximum radius of any address from the points to be geocoded. Used to accelerate the detection of minimum distances. Note, as the argument name suggests, the distance is in cartesian coordinates, so a small number is likely.
`close_enough`    
The distance, in metres, below which a match will be considered to have occurred. (The distance that is considered "close enough" to be a match.)

For example, close_enough = 10 means the first location within ten metres will be matched, even if a closer match occurs later.

May be provided as a string to emphasize the units, e.g. close_enough = "0.25km". Only km and m are permitted.

score 1 · Accepted Answer · answered Feb 09 '19 at 14:43

Your question is a bit more complicated than a simple merge, and outer is somewhat ill-suited for the purpose. To be as thorough as possible, we want to calculate the distance between all combinations of houses and stations, then keep only the closest station per house. We'll need two packages:

library(tidyverse)
library(geosphere)

First, a bit of prep. distm expects coordinates to be ordered as longitude first, latitude second (you have the opposite), so let's fix that, name the columns better, and correct a typo while we're at it:

houses <- data.frame(house_number = c(1:3),
                     lon_house = seq(1.4, 1.6, by = 0.1),
                     lat_house = seq(1.1, 1.3, by = 0.1)
                     )
stations <- data.frame(station_number = c(1:11),
                       lon_station = seq(2, 3, by = 0.1),
                       lat_station = seq(1, 2, by = 0.1)
                       )

We'll create "nested" data frames so that it's easier to keep coordinates together:

house_nest <- nest(houses, -house_number, .key = 'house_coords')
station_nest <- nest(stations, -station_number, .key = 'station_coords')

  house_number house_coords        
         <int> <list>              
1            1 <data.frame [1 × 2]>
2            2 <data.frame [1 × 2]>
3            3 <data.frame [1 × 2]>

   station_number station_coords      
            <int> <list>              
 1              1 <data.frame [1 × 2]>
 2              2 <data.frame [1 × 2]>
 3              3 <data.frame [1 × 2]>
 4              4 <data.frame [1 × 2]>
 5              5 <data.frame [1 × 2]>
 6              6 <data.frame [1 × 2]>
 7              7 <data.frame [1 × 2]>
 8              8 <data.frame [1 × 2]>
 9              9 <data.frame [1 × 2]>
10             10 <data.frame [1 × 2]>
11             11 <data.frame [1 × 2]>

Use dplyr::crossing to combine every row from both data frames:

data.master <- crossing(house_nest, station_nest)

   house_number house_coords         station_number station_coords      
          <int> <list>                        <int> <list>              
 1            1 <data.frame [1 × 2]>              1 <data.frame [1 × 2]>
 2            1 <data.frame [1 × 2]>              2 <data.frame [1 × 2]>
 3            1 <data.frame [1 × 2]>              3 <data.frame [1 × 2]>
 4            1 <data.frame [1 × 2]>              4 <data.frame [1 × 2]>
 5            1 <data.frame [1 × 2]>              5 <data.frame [1 × 2]>
 6            1 <data.frame [1 × 2]>              6 <data.frame [1 × 2]>
 7            1 <data.frame [1 × 2]>              7 <data.frame [1 × 2]>
 8            1 <data.frame [1 × 2]>              8 <data.frame [1 × 2]>
 9            1 <data.frame [1 × 2]>              9 <data.frame [1 × 2]>
10            1 <data.frame [1 × 2]>             10 <data.frame [1 × 2]>
# ... with 23 more rows

With all this now in place, we can use distm on each row to calculate a distance, and keep the shortest distance per house:

data.dist <- data.master %>% 
  mutate(dist = map2_dbl(house_coords, station_coords, distm)) %>% 
  group_by(house_number) %>% 
  filter(dist == min(dist))

  house_number house_coords         station_number station_coords         dist
         <int> <list>                        <int> <list>                <dbl>
1            1 <data.frame [1 × 2]>              1 <data.frame [1 × 2]> 67690.
2            2 <data.frame [1 × 2]>              1 <data.frame [1 × 2]> 59883.
3            3 <data.frame [1 × 2]>              1 <data.frame [1 × 2]> 55519.

Match two datasets by minimum geospatial distance (R)

2 Answers2

Linked