I have a large dataset where each row is a station. I need to find the closest station within each year but where a different type of equipment was used. I then want to either combine these rows into a new dataset where I have the lat/long and other station info for each pair of stations replicated next to each other in the same row, OR have some kind of index so I know which rows are related. I have managed to do this following this answer and plotted it, but it seems some stations have been linked to stations which are obviously not the closest. I don't understand if this is due to the way I have plotted the data or the way I have joined the closest stations - I would appreciate any pointers with this! I would also be interested in a more efficient way!
Thanks in advance for any help!!
Example code:
library(ggplot2)
library(plotly)
library(sf)
#data
set.seed(123)
latitude <- runif(100, 72, 81)
longitude <- runif(100, 20, 60)
gear <- factor(sample(1:2, 100, replace = TRUE))
year <- factor(sample(c(2020, 2021), 100, replace = TRUE))
orig.data <- data.frame(latitude, longitude, gear, year)
orig.data$lat<-orig.data$latitude # duplicating lat/long columns
orig.data$lon<-orig.data$longitude
df = st_as_sf(orig.data, coords=5:6) # making last 2 columns sf coordinates
# creating distance matrix
dm = st_distance(df)
ijd = data.frame(expand.grid(i=1:nrow(dm), j=1:nrow(dm)))
ijd$distance = c(dm)
# these following lines are a clunky way of copying the important info for each station pair
ijd$year.i = df$year[ijd$i]
ijd$year.j = df$year[ijd$j]
ijd$gear.i = df$gear[ijd$i]
ijd$gear.j = df$gear[ijd$j]
ijd$latitude.j = df$latitude[ijd$j]
ijd$longitude.j = df$longitude[ijd$j]
ijd$latitude.i = df$latitude[ijd$i]
ijd$longitude.i = df$longitude[ijd$i]
# Filter out different gears and keep matching years.
# This ensures a point can't be a nearest neighbour of itself.
ijd = ijd[ijd$year.i == ijd$year.j,]
ijd = ijd[ijd$gear.i != ijd$gear.j,]
# selecting the closest stations
# Split into data frames for each i point.
ijd.split = split(ijd, ijd$i)
nearest = function(d){
d = d[order(d$distance),]
d[1:min(c(nrow(d),1)),]
}
dn = lapply(ijd.split,nearest)
nnij = do.call(rbind, dn)
# removing duplicated equipment types
nnij2<-subset(nnij, as.factor(gear.i)==1)
# plotting closest stations using 'geom_segment'
# plot clearly shows some stations are joined to ones further away than the logical 'closest' station
ggplot(data = nnij2, aes(x = longitude.i, y = latitude.i, shape = gear.i))+geom_point()+geom_point(data = nnij2, aes(x = longitude.j, y = latitude.j, shape = gear.j))+
geom_segment(data = nnij2, aes(x = longitude.i, y = latitude.i, xend = longitude.j, yend = latitude.j, colour = distance))+
facet_wrap(~year.i)
# issue persists when projecting coordinates
ggplotly(basemap(limits=c(25,40,72,79))+
geom_spatial_point(data = nnij2, aes(x = longitude.i, y = latitude.i, shape = gear.i)) +
geom_spatial_point(data = nnij2, aes(x = longitude.j, y = latitude.j, shape = gear.j))+
geom_spatial_segment(data = nnij2, aes(x = longitude.i, y = latitude.i, xend = longitude.j, yend = latitude.j, colour = distance))+
facet_wrap(~year.i))
The red arrows in the image highlight one of the questionable joins - the top point should have been joined to the one on the right, but instead has been linked to the one below.