I have two large dataframes called intersections
(representing intersections of a street system) and users
(representing users of a network) as follows:
intersections
has three columns: x
,y
and label_street
. They respectively represent the position of an intersection in a squared observation window (say [0,5] x [0,5]) and the street it is located on. Here is an example:
intersections <- data.frame(x=c(0.147674, 0.235356, 0.095337, 0.147674), y=c(0.132956, 0.150813, 0.087345, 0.132956), label_street = c(5,6,5,6))
head(intersections)
x y label_street
1 0.147674 0.132956 5
2 0.235356 0.150813 6
3 0.095337 0.087345 5
4 0.147674 0.132956 6
An intersection being located at the crossing of several streets, every (x,y)
combination in the intersections
table appears at least twice, but with different label_street
(e.g. rows 1 and 4 in the previous example). The label_street
may not be the row number (which is why it starts at 5 in my example).
users
has 4 columns: x
,y
, label_street
, ID
. They respectively represent the position of a user, the street it is located on and a unique ID
per user. There are no duplicates in this dataframe, as a user is located on a unique street and has a unique ID
. Here is an example (the ID
and the label_street
may not be the row number)
users <- data.frame(x = c(0.20428152, 0.17840619, 0.12964668, 0.20423856, 0.19349761, 0.10861251), y = c(0.14448448, 0.13921481, 0.11724543, 0.14447573, 0.14228827, 0.09891443), label_street = c(6,6,5,6,6,5), ID = c(2703, 3460, 4325, 12506, 19753, 21282))
head(users)
x y label_street ID
1 0.20428152 0.14448448 6 2703
2 0.17840619 0.13921481 6 3460
3 0.12964668 0.11724543 5 4325
4 0.20423856 0.14447573 6 12506
5 0.19349761 0.14228827 6 19753
6 0.10861251 0.09891443 5 21282
What I want to do is the following: for each point (x,y)
of intersections
, get the ID
and the distance to its closest neighbour sharing the same street_label
in users
I have a working solution using spatstat
function nncross
for nearest neighbour searching and plyr
function adply
for working on the data.
My working solution is as follows:
1) Write a user-defined function which gets the ID and the distance to the nearest neighbour of a row in a query table
NN <- function(row,query){
df <- row
window <- c(0,5,0,5) #Need this to convert to ppp objects and compute NN distance using nncross
NN <- nncross(as.ppp(row[,1:2],window),as.ppp(query[,1:2],window))
df$NN.ID <- query$ID[NN$which]
df$dist <- NN$dist
return(df)
}
2) Apply this user-defined function row-wise to my dataframe "intersections" with the query being the subset of users sharing the same street_label as the row :
result <- adply(intersections, 1, function(row) NN(row, users[users$label_street == row$label_street, ])
The result is as follows on the example:
head(result)
x y label_street NN.ID NN.dist
1 0.147674 0.132956 5 4325 0.02391247
2 0.235356 0.150813 6 2703 0.03171236
3 0.095337 0.087345 5 21282 0.01760940
4 0.147674 0.132956 6 3460 0.03136304
Since my real dataframes will be huge, I think computing distance matrices for looking at the nearest neighbour won't be efficient and that adply
will be slow.
Does anyone have an idea of a data.table
like solution? I only now about the basics of data.table
and have always found it very efficient compared to plyr
.