13

From a list of 10,000 stations with decimal coordinates, I am trying to identify stations which are within 100 feet of each other based on the distance calculated between these stations and create a subset of these stations. In the final list I want to have the names of the stations which are within 100 feet of each other, their latitude and longitudes and the distance between them.

I found similar questions for other platforms like mathworks (using rangesearch) or in SQL or JAVA but none in R.

Is there a way to do this in R? The closest answer I found was in Listing number of obervations by location which lists the number of observations within a distance, but seems the answers were incomplete and cannot determine the stations which are within a particular distance of each other.

Basically I am trying to figure out which stations are co-located.

I'd really appreciate any help with this.

Community
  • 1
  • 1
user2653586
  • 209
  • 4
  • 11
  • 1
    You should be able to achieve this with the `dist` function – Señor O Jan 07 '14 at 21:53
  • can you give a *small* reproducible example? – Ben Bolker Jan 07 '14 at 21:54
  • Please provide some reproducible example with usable data. – Romain Francois Jan 07 '14 at 21:54
  • I was trying rdist.earth from the "fields" package, but it was generating 100x100 matrix if I use 100 points. So for 10,000 points won't it be a huge matrix? – user2653586 Jan 07 '14 at 22:06
  • I am including below a small portion of the dataset. If the data frame with this data is dpts, when I use `dist1 <- rdist.earth(dpts)` it creates a 9x9 vector with the distances. I am not sure what unit it calculates the distances in. How do I get the shorter list with those stations within (say) 50 meters? 1 -74.20139 39.82806 2 -74.20194 39.82806 3 -74.20167 39.82806 4 -74.20197 39.82824 5 -74.20150 39.82814 6 -74.26472 39.66639 7 -74.17389 39.87111 8 -74.07224 39.97353 9 -74.07978 39.94554 – user2653586 Jan 07 '14 at 22:18

3 Answers3

12

Two approaches.

The first creates a distance matrix using earth.dist(...) in the fossil package, and then takes advantage of data.tables to assemble the table of results.

The second uses distHaversine(...) in the geosphere package to calculate distances and assemble the final colocation table in one step. The latter approach may or may not be faster, but will certainly be more memory efficient, as it never stores the full distance matrix. Also, this approach is amenable to using other distance measures in geosphere, e.g., distVincentySphere(...), distVincentyEllipsoid(...), or distMeeus(...).

Note that the actual distances are slightly different, probably because earth.dist(...) and distHaversine(...) use slightly different estimates for the radius of the earth. Also, note that both approaches here rely on station numbers for IDs. If the stations have names, the code will have to be modified slightly.

First Approach: Using earth.dist(...)

df = read.table(header=T,text="long lat
                1 -74.20139 39.82806
                2 -74.20194 39.82806 
                3 -74.20167 39.82806 
                4 -74.20197 39.82824 
                5 -74.20150 39.82814 
                6 -74.26472 39.66639 
                7 -74.17389 39.87111 
                8 -74.07224 39.97353 
                9 -74.07978 39.94554")              # your sample data
library(fossil)                                     # for earth.dist(...)
library(data.table)
sep.ft   <- 200                                     # critical separation (feet)
sep.km   <- sep.ft*0.0003048                        # critical separation (km)
m        <- as.matrix(earth.dist(df))               # distance matrix in km
coloc    <- data.table(which(m<sep.km, arr.ind=T))  # pairs of stations with dist<200 ft
setnames(coloc,c("row","col"),c("ST.1","ST.2"))     # rename columns to reflect station IDs
coloc    <- coloc[ST.1<ST.2,]                       # want only lower triagular part
coloc[,dist:=m[ST.1,ST.2]/0.0003048,by="ST.1,ST.2"] # append distances in feet
remove(m)                                           # don't need distance matrix anymore...
stations <- data.table(id=as.integer(rownames(df)),df)
setkey(stations,id)
setkey(coloc,ST.1)
coloc[stations,c("long.1","lat.1"):=list(long,lat),nomatch=0]
setkey(coloc,ST.2)
coloc[stations,c("long.2","lat.2"):=list(long,lat),nomatch=0]

Produces this:

coloc
#     ST.1 ST.2      dist    long.1    lat.1    long.2    lat.2
#  1:    1    2 154.13436 -74.20139 39.82806 -74.20194 39.82806
#  2:    1    3  78.46840 -74.20139 39.82806 -74.20167 39.82806
#  3:    2    3  75.66596 -74.20194 39.82806 -74.20167 39.82806
#  4:    1    4 175.31180 -74.20139 39.82806 -74.20197 39.82824
#  5:    2    4  66.22069 -74.20194 39.82806 -74.20197 39.82824
#  6:    3    4 106.69018 -74.20167 39.82806 -74.20197 39.82824
#  7:    1    5  42.45634 -74.20139 39.82806 -74.20150 39.82814
#  8:    2    5 126.71608 -74.20194 39.82806 -74.20150 39.82814
#  9:    3    5  55.87449 -74.20167 39.82806 -74.20150 39.82814
# 10:    4    5 136.67612 -74.20197 39.82824 -74.20150 39.82814

Second Approach: Using distHaversine(...)

library(data.table)
library(geosphere)
sep.ft   <- 200                       # critical separation (feet)
stations <- data.table(id=as.integer(rownames(df)),df)

d <- function(x){                     # distance between station[i] and all subsequent stations
  r.ft <- 6378137*3.28084             # radius of the earth, in feet
  if (x[1]==nrow(stations)) return()  # don't process last row
  ref <- stations[(x[1]+1):nrow(stations),]
  z <- distHaversine(ref[,2:3,with=F],x[2:3], r=r.ft)
  z <- data.table(ST.1=x[1], ST.2=ref$id, dist=z, long.1=x[2], lat.1=x[3], long.2=ref$long, lat.2=ref$lat)
  return(z[z$dist<sep.ft,])
}
coloc.2 = do.call(rbind,apply(stations,1,d))

Produces this:

coloc.2
#     ST.1 ST.2      dist    long.1    lat.1    long.2    lat.2
#  1:    1    2 154.26350 -74.20139 39.82806 -74.20194 39.82806
#  2:    1    3  78.53414 -74.20139 39.82806 -74.20167 39.82806
#  3:    1    4 175.45868 -74.20139 39.82806 -74.20197 39.82824
#  4:    1    5  42.49191 -74.20139 39.82806 -74.20150 39.82814
#  5:    2    3  75.72935 -74.20194 39.82806 -74.20167 39.82806
#  6:    2    4  66.27617 -74.20194 39.82806 -74.20197 39.82824
#  7:    2    5 126.82225 -74.20194 39.82806 -74.20150 39.82814
#  8:    3    4 106.77957 -74.20167 39.82806 -74.20197 39.82824
#  9:    3    5  55.92131 -74.20167 39.82806 -74.20150 39.82814
# 10:    4    5 136.79063 -74.20197 39.82824 -74.20150 39.82814
jlhoward
  • 58,004
  • 7
  • 97
  • 140
11

Here is just some random example data

set.seed(1234)
x= sample(1:100,50)
y= sample(1:100,50)
M=cbind(x,y)
plot(M)

enter image description here

You calculate the distances as a matrix so that the original rows can be extracted easily. That can be done using the which function with arr.ind=T, like so:

DM= as.matrix(dist(M))
neighbors=which(DM < 5, arr.ind=T)
neighbors= neighbors[neighbors[,1]!=neighbors[,2]]

So you can identify the points which are say less than 5 units of euclidean distance separate (after removing self-self relations):

points(M[neighbors,], col="red" )

enter image description here

Stephen Henderson
  • 6,340
  • 3
  • 27
  • 33
2

I came across this solution which revolves on using a k-nearest-neighbours algorithm to find all points within a distance. It's much more effective than using the dist function which can be really long to compute on large grids, but it gives you the neighbours for only one point, which can be costly if you are to do that on every point. The main advantage is that neighbouring first avoids computing the distance for the whole grid while you're actually searching for only a part of the grid.

Please note that I did not take into account conversion from longitude/latitude to coordinates X/Y which is another subject.

Result of finding neighbours within a circle

The only downside is that you have to be sure that you select a large enough number of neighbours to look for. This implies a little analysis of your grid before searching for the neighbours (i.e on average, how many neighbours do you expect within a certain radius around your interest point ?)

if (!requireNamespace('FNN', quietly = TRUE)) install.packages('FNN')
knn_circle <- function(coordinates, vars = c('x', 'y'),
                   target = numeric(2), r = numeric(0), k = 10){

  # Find the row index of the target point
  target_row_number <- which(coordinates[[vars[1]]] == target[1] & 
    coordinates[[vars[2]]] == target[2])

  # Get k-nearest neighbours matrixes for all points in `coordinates`
  neighbours <- FNN::get.knn(data = coordinates[ , vars], k = k) 

  # Find col indexes of neighbours of target point that have a distance smaller
  # than `r`in nn.dist object
  neighbours_col_indexes <- which(neighbours$nn.dist[target_row_number, ] <= r)

  # Get the row indexes in `coordinates` of the neighbours from nn.index object
  neighbours_row_indexes <- neighbours$nn.index[target_row_number, 
    neighbours_col_indexes]

  # Uncomment to get also the target_point itself
  # neighbours_row_indexes <- c(target_row_number, neighbours_row_indexes)

  # Return the input data with only rows from the neighbours
  coordinates[neighbours_row_indexes, ]
}

This will return the input grid with all columns and only points that are located within a certain distance around your input target point. Here's an exemple

test_grid <- expand.grid(
  x = runif(n = 100, max = 10),
  y = runif(n = 50, max = 10)
)
test_grid$z <- paste('station', row.names(test_grid))

# Input target point as vector
target_point <- unlist(test_grid[5, c('x','y')])

within_stations <- knn_circle(
  coordinates = test_grid,
  target = target_point,
  r = 2, k = 1000
)

Finally this code allow you to visualise what is happening, using a circle function from that answer.

circleFun <- function(center = c(0,0), r = 1, npoints = 100){
  tt <- seq(0,2*pi,length.out = npoints)
  xx <- center[1] + r * cos(tt)
  yy <- center[2] + r * sin(tt)
  return(data.frame(x = xx, y = yy))
}

if (!requireNamespace('ggplot2', quietly = True)) install.packages('ggplot2')
ggplot(mapping = aes(x = x, y = y)) +
  # Draw whole grid
  geom_point(data = test_grid, color = '#666666',
             size = 0.5, alpha = 0.5) +
  # Draw circle
  geom_point(data = circleFun(center = target_point, r = 2, npoints = 1000),
             color = '#333333', size = 0.5) +
  # Draw within circle grid
  geom_point(data = within_stations, color = 'darkred', size = 0.5) +
  # Alleviate theme
  theme(plot.background = element_blank(),
    panel.background = element_blank(),
    axis.ticks = element_blank()) +
  labs(x = '', y = '')

Be careful to choose a sufficient large number of neighbours. Here is the same resulting picture but with k = 50, which is too few for a 50x100 uniform grid and a rather large radius.

The algorithm misses some points because of too few neighbours considered

Romain
  • 1,931
  • 1
  • 13
  • 24