3

I have two large dataframes called intersections (representing intersections of a street system) and users (representing users of a network) as follows:

intersections has three columns: x,y and label_street. They respectively represent the position of an intersection in a squared observation window (say [0,5] x [0,5]) and the street it is located on. Here is an example:


intersections <- data.frame(x=c(0.147674, 0.235356, 0.095337, 0.147674), y=c(0.132956, 0.150813, 0.087345, 0.132956), label_street = c(5,6,5,6))

head(intersections)

            x        y label_street
1    0.147674 0.132956            5
2    0.235356 0.150813            6
3    0.095337 0.087345            5
4    0.147674 0.132956            6

An intersection being located at the crossing of several streets, every (x,y) combination in the intersections table appears at least twice, but with different label_street (e.g. rows 1 and 4 in the previous example). The label_street may not be the row number (which is why it starts at 5 in my example).

users has 4 columns: x,y, label_street, ID. They respectively represent the position of a user, the street it is located on and a unique ID per user. There are no duplicates in this dataframe, as a user is located on a unique street and has a unique ID. Here is an example (the ID and the label_street may not be the row number)


users <- data.frame(x = c(0.20428152, 0.17840619, 0.12964668, 0.20423856, 0.19349761, 0.10861251), y = c(0.14448448, 0.13921481, 0.11724543, 0.14447573, 0.14228827, 0.09891443), label_street = c(6,6,5,6,6,5), ID = c(2703, 3460, 4325, 12506, 19753, 21282))


head(users)
              x          y label_street      ID
1    0.20428152 0.14448448            6    2703
2    0.17840619 0.13921481            6    3460
3    0.12964668 0.11724543            5    4325
4    0.20423856 0.14447573            6   12506
5    0.19349761 0.14228827            6   19753
6    0.10861251 0.09891443            5   21282

What I want to do is the following: for each point (x,y) of intersections, get the ID and the distance to its closest neighbour sharing the same street_label in users

I have a working solution using spatstat function nncross for nearest neighbour searching and plyr function adply for working on the data.

My working solution is as follows:

1) Write a user-defined function which gets the ID and the distance to the nearest neighbour of a row in a query table

 NN <- function(row,query){
 df <- row
 window <- c(0,5,0,5)   #Need this to convert to ppp objects and compute NN distance using nncross
 NN <- nncross(as.ppp(row[,1:2],window),as.ppp(query[,1:2],window))
 df$NN.ID <- query$ID[NN$which]
 df$dist <- NN$dist
 return(df)
}

2) Apply this user-defined function row-wise to my dataframe "intersections" with the query being the subset of users sharing the same street_label as the row :

result <- adply(intersections, 1, function(row) NN(row, users[users$label_street == row$label_street, ])

The result is as follows on the example:

head(result)
           x           y    label_street     NN.ID         NN.dist
1   0.147674    0.132956               5      4325      0.02391247
2   0.235356    0.150813               6      2703      0.03171236
3   0.095337    0.087345               5     21282      0.01760940
4   0.147674    0.132956               6      3460      0.03136304


Since my real dataframes will be huge, I think computing distance matrices for looking at the nearest neighbour won't be efficient and that adply will be slow. Does anyone have an idea of a data.table like solution? I only now about the basics of data.table and have always found it very efficient compared to plyr.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
QLG
  • 77
  • 5
  • I would use the package `RANN`, finding nearest neighbours in O(N * log(N)) time (i.e., fast), use a loop over the different levels of `label_street`, calling `RANN::nn2` for each level (only selecting rows with these corresponding levels in users and intersect). In `RANN::nn2` you can define the data and query (intersect and users) separately. The output is the row IDs and distances of nearest neighbours. Some book-keeping will be required to guarantee you obtain the correct indices in the end (i.e., across the different levels of `label_street`). – M. Papenberg Mar 04 '20 at 11:42
  • Many thanks on your answer. How would you loop then? `split` by `label_street` and `ddply`? – QLG Mar 04 '20 at 11:55
  • Honestly, I would just use a for loop. Seems to be the most simple thing here, and I guess that you only itereate over few labels - very few in comparison to your very large data I guess? The advantage of the for loop is that you can rather easily select the correct rows from both of your data.frames in each iteration. But maybe other convenient options are available as well that I currently do not consider. – M. Papenberg Mar 04 '20 at 12:01
  • Or, you define a function that finds nearest neighbours, returning the correct IDs according to your column `ID` and then use `lapply`? Maybe that is better ... – M. Papenberg Mar 04 '20 at 12:03
  • There are possibly quite a large number of labels actually (in average `nrow(intersections)/2`). Also, what do you mean with the lapply solution? Not sure to get it right – QLG Mar 04 '20 at 12:04
  • Sorry, I haven't thought it through entirely. I would try a for loop solution, but the efficiency may be seriously undermined if there are so many `street_labels` (but another kind of loop will probably not help much here). RANN is efficient because it can be applied to the whole data set but somewhat loses its point if you have to apply it to almost every other row. – M. Papenberg Mar 04 '20 at 12:07
  • Thanks, I'll give it a try though. Anyway, my solution is so slow that I'm in for any other solution – QLG Mar 04 '20 at 12:09
  • I just had an idea: How about you add a column with a very large value for each category (i.e., 1000 for label_street = 1, 2000 for label_street = 2 etc). Then you should be able to call `RANN::nn2` only onced on the entire data set (with the columns x, y, and the new column), while ensuring that neighbours are within the same label_street? Then no loop is needed at all! – M. Papenberg Mar 04 '20 at 12:14

1 Answers1

2

This solution uses the RANN package to find nearest neighbours. The trick is to first ensure that elements with different label_street have a higher distance between them than elements within the same label_street. We do this by adding an additional numeric column with a very large value that is constant within the same label_street but different between different values of label_street. In total, you get:

intersections <- data.frame(x=c(0.147674, 0.235356, 0.095337, 0.147674), y=c(0.132956, 0.150813, 0.087345, 0.132956), label_street = c(5,6,5,6))
users <- data.frame(x = c(0.20428152, 0.17840619, 0.12964668, 0.20423856, 0.19349761, 0.10861251), y = c(0.14448448, 0.13921481, 0.11724543, 0.14447573, 0.14228827, 0.09891443), label_street = c(6,6,5,6,6,5), number = c(2703, 3460, 4325, 12506, 19753, 21282))

# add a numeric column that is constant within each category and has a very large value
intersections$label_street_large <- intersections$label_street * 1e6
users$label_street_large <- users$label_street * 1e6

# call the nearest neighbour function (k = 1 neighbour)
nearest_neighbours <- RANN::nn2(
  intersections[, c("x", "y", "label_street_large")],
  users[, c("x", "y", "label_street_large")],
  k = 1
)

# get original IDs and distances
IDs <- users$number[c(nearest_neighbours$nn.idx)]
distances <- c(nearest_neighbours$nn.dists)

IDs
# [1]  3460 12506  2703  3460  3460  4325
distances
# [1] 0.03171236 0.03136304 0.02391247 0.03175620 0.04271763 0.01760940

I hope this helps you. It should be very fast because it only call nn2 once, which runs in O(N * log(N)) time.

M. Papenberg
  • 502
  • 2
  • 7
  • Thanks! That seems great to me! I have one question about `RANN:nn2` though: what happens when either `query` or `data` is empty? For instance, I might have a `street_label`appearing in `intersections`with no matching rows in `users`. I think one needs to add an exception for such a case, right? Another point: don't we need to reverse the order of `intersections` and `users`here? I think that we should use `data=users` and `query=intersections` here, as we want the NN of each `intersection` row – QLG Mar 04 '20 at 12:38
  • Actually, it's pretty straightforward: we just need to remove from `intersections` those rows that have no matching in `users` and then perform your solution. It's extremely fast! Thank you so much! – QLG Mar 04 '20 at 13:25
  • Glad I could help =) – M. Papenberg Mar 04 '20 at 15:31