7

Here is an example of a problem I am attempting to solve and implements in a much larger database:

I have a sparse grid of points across the new world, with lat and long defined as below.

LAT<-rep(-5:5*10, 5)
LON<-rep(seq(-140, -60, by=20), each=11)

I know the color of some points on my grid

COLOR<-(c(NA,NA,NA,"black",NA,NA,NA,NA,NA,"red",NA,NA,"green",NA,"blue","blue",NA,"blue",NA,NA,"yellow",NA,NA,"yellow",NA+
  NA,NA,NA,"blue",NA,NA,NA,NA,NA,NA,NA,"black",NA,"blue","blue",NA,"blue",NA,NA,"yellow",NA,NA,NA,NA,"red",NA,NA,"green",NA,"blue","blue"))
data<-as.data.frame(cbind(LAT,LON,COLOR))

What I want to do is replace the NA values in COLOR with the color that is closeset (in distance) to that point. In the actual implementation, I am not worried too much with ties, but I suppose it is possible (I could probably fix those by hand).

Thanks

  • I reckon if you split the data frame into those with colours and those without you could feed it into FNN::get.knnx(colours,blanks) and use the fast nearest neighbour code... Hmmm... – Spacedman Aug 20 '12 at 16:52

2 Answers2

8

Yup.

First, make your data frame with data.frame or things all get coerced to characters:

data<-data.frame(LAT=LAT,LON=LON,COLOR=COLOR)

Split the data frame up - you could probably do this in one go but this makes things a bit more obvious:

query = data[is.na(data$COLOR),]
colours = data[!is.na(data$COLOR),]
library(FNN)
neighs = get.knnx(colours[,c("LAT","LON")],query[,c("LAT","LON")],k=1)

Now insert the replacement colours directly into the data dataframe:

data[is.na(data$COLOR),"COLOR"]=colours$COLOR[neighs$nn.index]
plot(data$LON,data$LAT,col=data$COLOR,pch=19)

Note however that distance is being computed using pythagoras geometry on lat-long, which isn't true because the earth isn't flat. You might have to transform your coordinates to something else first.

Spacedman
  • 92,590
  • 12
  • 140
  • 224
  • This is great. Thank you. I will try it out. I thought of that last issue, but its not a large issue for the actual dataset - distances are quite close (I am finding the nearest country to points just off the coast of that country) – user1612278 Aug 20 '12 at 17:09
1

I came up with this solution, but Spacedman's seems much better. Note that I also assume the Earth is flat here :)

# First coerce to numeric from factor:
data$LAT <- as.numeric(as.character(data$LAT))
data$LON <- as.numeric(as.character(data$LON))

n <- nrow(data)

# Compute Euclidean distances:
Dist <- outer(1:n,1:n,function(i,j)sqrt((data$LAT[i]-data$LAT[j])^2 + (data$LON[i]-data$LON[j])^2))

# Dummy second data:
data2 <- data

# Loop over data to fill:
for (i in 1:n)
{
  if (is.na(data$COLOR[i]))
  {
    data$COLOR[i] <- data2$COLOR[order(Dist[i,])[!is.na(data2$COLOR[order(Dist[i,])])][1]]
  }
}
Sacha Epskamp
  • 46,463
  • 20
  • 113
  • 131