4

I am trying to find the ID of the closest LAT_LON in a data.frame with reference to my original data.frame. I have already figured this out by merging both data.frames on a unique identifier and the calculating the distance based on the distHaverSine function from geosphere. Now, I want to take step further and join the data.frames without the unique identifier and find ID the nearest LAT-LON. I have used the following code after merging:

v3 <-v2 %>% mutate(CTD = distHaversine(cbind(LON.x, LAT.x), cbind(LON.y, LAT.y)))

DATA:

loc <- data.frame(station = c('Baker Street','Bank'),
     lat = c(51.522236,51.5134047),
     lng = c(-0.157080, -0.08905843),
               postcode = c('NW1','EC3V'))
stop <- data.frame(station = c('Angel','Barbican','Barons Court','Bayswater'),
                lat = c(51.53253,51.520865,51.490281,51.51224),
                lng = c(-0.10579,-0.097758,-0.214340,-0.187569),
                postcode = c('EC1V','EC1A', 'W14', 'W2'))

As a final result I would like something like this:

df <- data.frame(loc = c('Baker Street','Bank','Baker Street','Bank','Baker Street','Bank','Baker 
        Street','Bank'), 
              stop = c('Angel','Barbican','Barons Court','Bayswater','Angel','Barbican','Barons Court','Bayswater'), 
              dist = c('x','x','x','x','x','x','x','x'), 
              lat = c(51.53253,51.520865,51.490281,51.51224,51.53253,51.520865,51.490281,51.51224), 
              lng = c(-0.10579,-0.097758,-0.214340,-0.187569,-0.10579,-0.097758,-0.214340,-0.187569),
              postcode = c('EC1V','EC1A', 'W14', 'W2','EC1V','EC1A', 'W14', 'W2')
              )

Any help is appreciated. Thanks.

s_baldur
  • 29,441
  • 4
  • 36
  • 69
marine8115
  • 588
  • 3
  • 22
  • This might be helpful https://stackoverflow.com/questions/21977720/r-finding-closest-neighboring-point-and-number-of-neighbors-within-a-given-rad – Ronak Shah Jan 16 '20 at 09:09
  • @RonakShah, it soes not solve the question as my dataset is too large. keeps computing for a long time – marine8115 Jan 16 '20 at 09:46
  • 1
    Here is another potential option. https://stackoverflow.com/questions/58831578/minimum-distance-between-lat-long-across-multiple-data-frames/58841322#58841322. This is a M*N problem, as either dataframe grows the it just takes longer. To improve performance, reduce the size of the problem, either using a divide and conquer algorithm or reduce the precision of the starting locations from 5 decimal places down to three places. If you round the starting locations, you may find a large number of duplicates and thus save the time of recalculating. – Dave2e Jan 16 '20 at 13:55
  • Thanks for that @Dave2e. I cannot reduce the precision as I am dealing with objects very close to each other.I can reduce the size of the problem, does `distmatrix` calculate `Haversine distance` by default? Thanks – marine8115 Jan 16 '20 at 14:11
  • I believe it uses the `distGeo` method which assumes an ellipsoidal and not a sphere. – Dave2e Jan 16 '20 at 14:37
  • As the objects are very close together, you should just use Euclidean distance, not Haversine. This will save processing time, but won't reduce the dimensionality of the matrix. – dww Jan 19 '20 at 17:42
  • You could have a look at the `FNN` (https://cran.r-project.org/web/packages/FNN/index.html) package. – Jan van der Laan Jan 19 '20 at 18:53
  • Is there any particular reason you are using a geographic coordinate system (lat/long)? Unless your locations have a very wide geographic spread then the whole thing becomes a lot easier if you can use an appropriate projected coordinate system. That avoids having to do Haversine distances (slow) or treating the latitude and longitude as planar (the Equirectangular projection) and calculating euclidean distances. That works approximately at the equator but if your data is really from London, then it it introduces substantial errors. – David_O Jan 23 '20 at 10:32

4 Answers4

6

All of the joining, distance calculations, and plotting can be done with available R packages.

library(tidyverse)
library(sf)
#> Linking to GEOS 3.6.2, GDAL 2.2.3, PROJ 4.9.3
library(nngeo)
library(mapview)

## Original data
loc <- data.frame(station = c('Baker Street','Bank'),
                  lat = c(51.522236,51.5134047),
                  lng = c(-0.157080, -0.08905843),
                  postcode = c('NW1','EC3V'))

stop <- data.frame(station = c('Angel','Barbican','Barons Court','Bayswater'),
                   lat = c(51.53253,51.520865,51.490281,51.51224),
                   lng = c(-0.10579,-0.097758,-0.214340,-0.187569),
                   postcode = c('EC1V','EC1A', 'W14', 'W2'))

df <- data.frame(loc = c('Baker Street','Bank','Baker Street','Bank','Baker Street','Bank','Baker 
        Street','Bank'), 
                 stop = c('Angel','Barbican','Barons Court','Bayswater','Angel','Barbican','Barons Court','Bayswater'), 
                 dist = c('x','x','x','x','x','x','x','x'), 
                 lat = c(51.53253,51.520865,51.490281,51.51224,51.53253,51.520865,51.490281,51.51224), 
                 lng = c(-0.10579,-0.097758,-0.214340,-0.187569,-0.10579,-0.097758,-0.214340,-0.187569),
                 postcode = c('EC1V','EC1A', 'W14', 'W2','EC1V','EC1A', 'W14', 'W2')
)



## Create sf objects from lat/lon points
loc_sf <- loc %>% st_as_sf(coords = c('lng', 'lat'), remove = T) %>%
  st_set_crs(4326) 

stop_sf <- stop %>% st_as_sf(coords = c('lng', 'lat'), remove = T) %>%
  st_set_crs(4326) 


# Use st_nearest_feature to cbind loc to stop by nearest points
joined_sf <- stop_sf %>% 
  cbind(
    loc_sf[st_nearest_feature(stop_sf, loc_sf),])


## mutate to add column showing distance between geometries
joined_sf %>%
  mutate(dist = st_distance(geometry, geometry.1, by_element = T))
#> Simple feature collection with 4 features and 5 fields
#> Active geometry column: geometry
#> geometry type:  POINT
#> dimension:      XY
#> bbox:           xmin: -0.21434 ymin: 51.49028 xmax: -0.097758 ymax: 51.53253
#> epsg (SRID):    4326
#> proj4string:    +proj=longlat +datum=WGS84 +no_defs
#>        station postcode    station.1 postcode.1                   geometry
#> 1        Angel     EC1V         Bank       EC3V  POINT (-0.10579 51.53253)
#> 2     Barbican     EC1A         Bank       EC3V POINT (-0.097758 51.52087)
#> 3 Barons Court      W14 Baker Street        NW1  POINT (-0.21434 51.49028)
#> 4    Bayswater       W2 Baker Street        NW1 POINT (-0.187569 51.51224)
#>                    geometry.1         dist
#> 1 POINT (-0.08905843 51.5134) 2424.102 [m]
#> 2 POINT (-0.08905843 51.5134) 1026.449 [m]
#> 3   POINT (-0.15708 51.52224) 5333.417 [m]
#> 4   POINT (-0.15708 51.52224) 2390.791 [m]



## Use nngeo and mapview to plot lines on a map
# NOT run for reprex, output image attached 
#connected <- st_connect(stop_sf, loc_sf)
# mapview(connected) + 
#   mapview(loc_sf, color = 'red') +
#   mapview(stop_sf, color = 'black')

Created on 2020-01-21 by the reprex package (v0.3.0)

enter image description here

mrhellmann
  • 5,069
  • 11
  • 38
5

As the distances between the object are small we can speed up the computation by using the euclidian distance between the coordinates. As we are not around the equator, the lng coordinates are squished a bit; we can make the comparison slightly better by scaling the lng a bit.

cor_stop <- stop[, c("lat", "lng")]
cor_stop$lng <- cor_stop$lng * sin(mean(cor_stop$lat, na.rm = TRUE)/180*pi)
cor_loc <- loc[, c("lat", "lng")]
cor_loc$lng <- cor_loc$lng * sin(mean(cor_loc$lat, na.rm = TRUE)/180*pi)

We can then calculate the closest stop for each location using the FNN package which uses tree based search to quickly find the closest K neighbours. This should scale to big data sets (I have used this for datasets with millions of records):

library(FNN)
matches <- knnx.index(cor_stop, cor_loc, k = 1)
matches
##      [,1]
## [1,]    4
## [2,]    2

We can then construct the end result:

res <- loc
res$stop_station  <- stop$station[matches[,1]]
res$stop_lat      <- stop$lat[matches[,1]]
res$stop_lng      <- stop$lng[matches[,1]]
res$stop_postcode <- stop$postcode[matches[,1]]

And calculate the actual distance:

library(geosphere)
res$dist <- distHaversine(res[, c("lng", "lat")], res[, c("stop_lng", "stop_lat")])
res
##          station      lat         lng postcode stop_station stop_lat  stop_lng
## 1 Baker Street 51.52224 -0.15708000      NW1    Bayswater 51.51224 -0.187569
## 2         Bank 51.51340 -0.08905843     EC3V     Barbican 51.52087 -0.097758
##   stop_postcode     dist
## 1            W2 2387.231
## 2          EC1A 1026.091

I you are unsure that the closest point in lat-long is also the closest point 'as the bird flies', you could use this method to first select the K closest points in lat-long; then calculate the distances for those points and then selecting the closest point.

Jan van der Laan
  • 8,005
  • 1
  • 20
  • 35
0

You can avoid searching for nearest neighbours completely if you are able to use a projected coordinate system. If you can, then you can cheaply construct Voronoi polygons around each location - these polygons define areas that are closest to each of the input points.

You can then just use GIS intersections to find which points lie in which polygons and then calculate the distances for known pairs of closest points. I think this should be much faster. However, you can't use Voronoi polygons with geographic coordinates.

loc <- data.frame(station = c('Baker Street','Bank'),
     lat = c(51.522236,51.5134047),
     lng = c(-0.157080, -0.08905843),
               postcode = c('NW1','EC3V'))

stop <- data.frame(station = c('Angel','Barbican','Barons Court','Bayswater'),
                lat = c(51.53253,51.520865,51.490281,51.51224),
                lng = c(-0.10579,-0.097758,-0.214340,-0.187569),
                postcode = c('EC1V','EC1A', 'W14', 'W2'))

# Convert to a suitable PCS (in this case OSGB)
stop <- st_as_sf(stop, coords=c('lng','lat'), crs=4326)
stop <- st_transform(stop, crs=27700)
loc <- st_as_sf(loc, coords=c('lng','lat'), crs=4326)
loc <- st_transform(loc, crs=27700)

# Extract Voronoi polygons around locations and convert to an sf object
loc_voronoi <- st_collection_extract(st_voronoi(do.call(c, st_geometry(loc))))
loc_voronoi <- st_sf(loc_voronoi, crs=crs(loc))

# Match Voronoi polygons to locations and select that geometry
loc$voronoi <- loc_voronoi$loc_voronoi[unlist(st_intersects(loc, loc_voronoi))]
st_geometry(loc) <- 'voronoi'

# Find which stop is closest to each location
stop$loc <- loc$station[unlist(st_intersects(stop, loc))]

# Reset locs to use the point geometry and get distances
st_geometry(loc) <- 'geometry'
stop$loc_dist <- st_distance(stop, loc[stop$loc,], by_element=TRUE)

That gives you the following output:

Simple feature collection with 4 features and 4 fields
geometry type:  POINT
dimension:      XY
bbox:           xmin: 524069.7 ymin: 178326.3 xmax: 532074.6 ymax: 183213.9
epsg (SRID):    27700
proj4string:    +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +towgs84=446.448,-125.157,542.06,0.15,0.247,0.842,-20.489 +units=m +no_defs
       station postcode                  geometry          loc     loc_dist
1        Angel     EC1V POINT (531483.8 183213.9)         Bank 2423.722 [m]
2     Barbican     EC1A POINT (532074.6 181931.2)         Bank 1026.289 [m]
3 Barons Court      W14 POINT (524069.7 178326.3) Baker Street 5332.478 [m]
4    Bayswater       W2 POINT (525867.7 180813.9) Baker Street 2390.377 [m]
David_O
  • 1,143
  • 7
  • 16
0

I am not sure whether I understand your question correctly, but you can first cross join loc and stop, and then add a column with the distance.

library(dplyr)

loc <- data.frame(station = c('Baker Street','Bank'),
                 lat = c(51.522236,51.5134047),
                 lng = c(-0.157080, -0.08905843),
                 postcode = c('NW1','EC3V'))

stop <- data.frame(station = c('Angel','Barbican','Barons Court','Bayswater'),
                  lat = c(51.53253,51.520865,51.490281,51.51224),
                  lng = c(-0.10579,-0.097758,-0.214340,-0.187569),
                  postcode = c('EC1V','EC1A', 'W14', 'W2'))

# Create data.tables
loc_dt <- data.table::setDT(loc)
stop_dt <- data.table::setDT(stop)

# Cross join two data.tables
coordinates_dt <- optiRum::CJ.dt(loc_dt, stop_dt)

# Add column with distance in meters
coordinates_dt %>%
 mutate(dist_m = spatialrisk::haversine(lat, lng, i.lat, i.lng))
#>         station      lat         lng postcode    i.station    i.lat     i.lng
#> 1: Baker Street 51.52224 -0.15708000      NW1        Angel 51.53253 -0.105790
#> 2:         Bank 51.51340 -0.08905843     EC3V        Angel 51.53253 -0.105790
#> 3: Baker Street 51.52224 -0.15708000      NW1     Barbican 51.52087 -0.097758
#> 4:         Bank 51.51340 -0.08905843     EC3V     Barbican 51.52087 -0.097758
#> 5: Baker Street 51.52224 -0.15708000      NW1 Barons Court 51.49028 -0.214340
#> 6:         Bank 51.51340 -0.08905843     EC3V Barons Court 51.49028 -0.214340
#> 7: Baker Street 51.52224 -0.15708000      NW1    Bayswater 51.51224 -0.187569
#> 8:         Bank 51.51340 -0.08905843     EC3V    Bayswater 51.51224 -0.187569
#>    i.postcode   dist_m
#> 1:       EC1V 3732.422
#> 2:       EC1V 2423.989
#> 3:       EC1A 4111.786
#> 4:       EC1A 1026.091
#> 5:        W14 5328.649
#> 6:        W14 9054.998
#> 7:         W2 2387.231
#> 8:         W2 6825.897

Created on 2021-04-07 by the reprex package (v1.0.0)

mharinga
  • 1,708
  • 10
  • 23