1

I have a large dataset where each row is a station. I need to find the closest station within each year but where a different type of equipment was used. I then want to either combine these rows into a new dataset where I have the lat/long and other station info for each pair of stations replicated next to each other in the same row, OR have some kind of index so I know which rows are related. I have managed to do this following this answer and plotted it, but it seems some stations have been linked to stations which are obviously not the closest. I don't understand if this is due to the way I have plotted the data or the way I have joined the closest stations - I would appreciate any pointers with this! I would also be interested in a more efficient way!

Thanks in advance for any help!!

Example code:

library(ggplot2)
library(plotly)
library(sf)

#data
set.seed(123)
latitude <- runif(100, 72, 81)
longitude <- runif(100, 20, 60)
gear <- factor(sample(1:2, 100, replace = TRUE))
year <- factor(sample(c(2020, 2021), 100, replace = TRUE))
orig.data <- data.frame(latitude, longitude, gear, year)


orig.data$lat<-orig.data$latitude # duplicating lat/long columns 
orig.data$lon<-orig.data$longitude
df = st_as_sf(orig.data, coords=5:6) # making last 2 columns sf coordinates
# creating distance matrix
dm = st_distance(df)
ijd = data.frame(expand.grid(i=1:nrow(dm), j=1:nrow(dm)))
ijd$distance = c(dm)

# these following lines are a clunky way of copying the important info for each station pair
ijd$year.i = df$year[ijd$i] 
ijd$year.j = df$year[ijd$j]
ijd$gear.i = df$gear[ijd$i]
ijd$gear.j = df$gear[ijd$j]
ijd$latitude.j = df$latitude[ijd$j]
ijd$longitude.j = df$longitude[ijd$j]
ijd$latitude.i = df$latitude[ijd$i]
ijd$longitude.i = df$longitude[ijd$i]

# Filter out different gears and keep matching years. 
# This ensures a point can't be a nearest neighbour of itself.
ijd = ijd[ijd$year.i == ijd$year.j,]
ijd = ijd[ijd$gear.i != ijd$gear.j,]

# selecting the closest stations
# Split into data frames for each i point.
ijd.split = split(ijd, ijd$i)

nearest = function(d){
  d = d[order(d$distance),]
  d[1:min(c(nrow(d),1)),]
}

dn = lapply(ijd.split,nearest)
nnij = do.call(rbind, dn)

# removing duplicated equipment types
nnij2<-subset(nnij, as.factor(gear.i)==1)

# plotting closest stations using 'geom_segment'
# plot clearly shows some stations are joined to ones further away than the logical 'closest' station
ggplot(data = nnij2, aes(x = longitude.i, y = latitude.i, shape = gear.i))+geom_point()+geom_point(data = nnij2, aes(x = longitude.j, y = latitude.j, shape = gear.j))+
  geom_segment(data = nnij2, aes(x = longitude.i, y = latitude.i, xend = longitude.j, yend = latitude.j, colour = distance))+
  facet_wrap(~year.i)

# issue persists when projecting coordinates
ggplotly(basemap(limits=c(25,40,72,79))+
           geom_spatial_point(data = nnij2, aes(x = longitude.i, y = latitude.i, shape = gear.i)) +
           geom_spatial_point(data = nnij2, aes(x = longitude.j, y = latitude.j, shape = gear.j))+
           geom_spatial_segment(data = nnij2, aes(x = longitude.i, y = latitude.i, xend = longitude.j, yend = latitude.j, colour = distance))+
           facet_wrap(~year.i))

The red arrows in the image highlight one of the questionable joins - the top point should have been joined to the one on the right, but instead has been linked to the one below.

The red arrows here highlight one of the questionable joins - the top point should have been joined to the one on the right, but instead has been linked to the one below.

John Polo
  • 547
  • 1
  • 8
  • 25
GenieV
  • 33
  • 5
  • 1
    Please see [how to make a minimal, reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). I wouldn't call your code minimal - it's quite long and hard to follow (especially with variables called `dn`, `ijd`, `dg`, `dm`, `n3` etc.). It's also not reproducible - you need to `set.seed()` if you're using functions like `runif()`. Finally it's unclear (to me at least) how the output you're getting is not what you expect, or in fact what output you do expect. You're much likelier to get an answer if you amend the question. – SamR Mar 21 '23 at 17:04
  • Thanks - I have updated my code to make the variable names easier to follow. – GenieV Mar 21 '23 at 17:39
  • When you call `df = st_as_sf(orig.data, coords=5:6)` you are switching the coordinates around, it expects longitude first then latitude. This probably is not the cause of your issue but should be corrected. – Stuart Allen Mar 24 '23 at 03:41
  • There are a couple of issues here. Firstly, add a line to your `ggplot()` call `+ coord_equal()`. This will force the axes to use the same scale, and you'll see that in fact your code, as written, works - in the sense that for each "gear 1" station it will successfully find the nearest "gear 2" station. BUT, you're not necessarily finding for each "gear 2" station the nearest "gear 1" station. This is because of the line `nnij2<-subset(nnij, as.factor(gear.i)==1)`. This eliminates rows of data that may actually be the shortest path between two points. You need to reconsider your methodology. – Stuart Allen Mar 24 '23 at 06:10

0 Answers0