0

I have two data frames , one is with 0.8 million rows with x and Y coordinates, another data frame is of 70000 rows with X and Y coordinates. I want to know logic and code in R where I want to associate data point from frame 1 to closest point in data frame 2. Is there any standard package to do so ?

I am running nested for loop. But this is very slow as it is getting iterated for 0.8 million * 70000 times which is very time consuming.

M--
  • 25,431
  • 8
  • 61
  • 93
Nitin Wader
  • 13
  • 1
  • 2
  • 1
    Please add some data (only a snippet, e.g. using `dput(head(your_data))`), code and your expected output. – Roman Oct 24 '16 at 09:26
  • For geospatial data see http://stackoverflow.com/questions/31766351/calculating-the-distance-between-points-in-different-data-frames, for euclidian distance see http://stackoverflow.com/questions/26720367/how-to-find-the-distance-between-two-data-frames and http://stackoverflow.com/questions/22231773/calculating-the-euclidean-dist-between-each-row-of-a-dataframe-with-all-other-ro. I found these by googling for `r calculate distance between two data.frames`. Also look through the other hits from that google search, there is quite a lot already available. – Paul Hiemstra Oct 24 '16 at 13:59

1 Answers1

2

I found a faster way to get the expected result using the data.table library:

library(data.table)

time0 <- Sys.time()

Here is some random data:

df1 <- data.table(x = runif(8e5), y = runif(8e5))
df2 <- data.table(x = runif(7e4), y = runif(7e4))

Assuming (x,y) are the coordinates in an orthonormal coordinate system, you can compute the square of the distance as follow:

dist <- function(a, b){
                dt <- data.table((df2$x-a)^2+(df2$y-b)^2)
                return(which.min(dt$V1))}

And now you can applied this function to your data to get the expected result:

results <- df1[, j = list(Closest =  dist(x, y)), by = 1:nrow(df1)]

time1 <- Sys.time()
print(time1 - time0)

It tooked me around 30 minutes to get the result on a slow computer.

EDIT:

As asked, I have tried severals other solutions using sapply or using adply from the plyr package. I have tested these solutions on smaller data frames to make it faster.

library(data.table)
library(plyr)
library(microbenchmark)

########################
## Test 1: data.table ##
########################

dt1 <- data.table(x = runif(1e4), y = runif(1e4))
dt2 <- data.table(x = runif(5e3), y = runif(5e3))

dist1 <- function(a, b){
                dt <- data.table((dt2$x-a)^2+(dt2$y-b)^2)
                return(which.min(dt$V1))}

results1 <- function() return(dt1[, j = list(Closest =  dist1(x, y)), by = 1:nrow(dt1)])

###################
## Test 2: adply ##
###################

df1 <- data.frame(x = runif(1e4), y = runif(1e4))
df2 <- data.frame(x = runif(5e3), y = runif(5e3))

dist2 <- function(df){
                dt <- data.table((df2$x-df$x)^2+(df2$y-df$y)^2)
                return(which.min(dt$V1))}

results2 <- function() return(adply(.data = df1, .margins = 1, .fun = dist2))

####################
## Test 3: sapply ##
####################

df1 <- data.frame(x = runif(1e4), y = runif(1e4))
df2 <- data.frame(x = runif(5e3), y = runif(5e3))

dist2 <- function(df){
                dt <- data.table((df2$x-df$x)^2+(df2$y-df$y)^2)
                return(which.min(dt$V1))}

results3 <- function() return(sapply(1:nrow(df1), function(x) return(dist2(df1[x,]))))

microbenchmark(results1(), results2(), results3(), times = 20)

#Unit: seconds
#       expr      min       lq     mean   median       uq      max neval
# results1() 4.046063 4.117177 4.401397 4.218234 4.538186 5.724824    20
# results2() 5.503518 5.679844 5.992497 5.886135 6.041192 7.283477    20
# results3() 4.718865 4.883286 5.131345 4.949300 5.231807 6.262914    20

The first solution seems to be significantly faster than the 2 other. This is even more true for a larger dataset.

Frank
  • 66,179
  • 8
  • 96
  • 180
Hugo
  • 507
  • 8
  • 22
  • +1! In this question people already suggest solutions: http://stackoverflow.com/questions/22231773/calculating-the-euclidean-dist-between-each-row-of-a-dataframe-with-all-other-ro. I would be very interested to see how you solution fares against those (I think it should be a fair bit faster). – Paul Hiemstra Oct 24 '16 at 14:00
  • @PaulHiemstra Isn't it a perfect dupe? – Frank Oct 24 '16 at 15:27
  • @Frank I am not sure as the solution suggested in the other post may not be adapted to data frames with that size... – Hugo Oct 24 '16 at 15:32
  • Thanks Frank, I used your logic and built up my code like - below -- – Nitin Wader Oct 25 '16 at 06:50
  • MyTestData1<-MyTestData1[c(1:2)] dist <- function(a, b){ dt <- data.table(abs(MyDataFrame$V1-a)+abs(MyDataFrame$V2-b)) return(which.min(dt$V3))} results <- MyTestData1[,j = list(closest = dist(x, y)), by = 1:nrow(MyTestData1)] – Nitin Wader Oct 25 '16 at 06:51
  • Error in `[.data.frame`(MyTestData1, , j = list(closest = dist(x, y)), : unused argument (by = 1:nrow(MyTestData1)) – Nitin Wader Oct 25 '16 at 06:52
  • 1
    @frank the solution here uses data.table, and thus is probably quite fast. – Paul Hiemstra Oct 25 '16 at 08:54
  • @NitinWader please provide a reproducible example in your question above that reproduces the error you show. – Paul Hiemstra Oct 25 '16 at 08:55
  • @NitinWader In order to work, `MyTestData` has to be a `data.table` object. You can easily do that like this : `library(data.table) ; MyTestData <- data.table(MyTestData)` – Hugo Oct 28 '16 at 11:39
  • I did conversion from data frame to data table. Program is running but it is damn slow. – Nitin Wader Nov 08 '16 at 04:51
  • @NitinWader it took 29 minutes to run on my computer. – Hugo Nov 08 '16 at 08:48