4

I am trying to find the distance between points in two different data frames given that they have the same value in one of their columns.

I figure the first step is to join or relate the data in the two data frames. For example there is dataframe A and B which both have lat/long information in them and they share the column Name. Note that for a given Name the lat/long information is different in each dataframe. Thats why I want to calculate the distance between them.

I envision the final function being something like if A$Name=B$Name then use their corresponding lat/long data to calculate the distance between them.

Any thoughts?

Example data:

A <- data.frame(Lat=1:4,Long=1:4,Name=c("a","b","c","d"))
B <- data.frame(Lat=5:8,Long=5:8,Name=c("a","b","c","d"))

Now I want to relate A and B so that I can ask the ultimate question if A$Name==B$Name what is the distance between them using their corresponding lat long data.

I should also note that I will not be able to do a straightforward euclidean distance because the points occur in water and the path distance between them needs to be in the water (or bounded by some area). Any help with that would be appreciated as well.

Jaap
  • 81,064
  • 34
  • 182
  • 193
wraymond
  • 295
  • 1
  • 6
  • 17
  • you should come up with an [MRE](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) demonstrating your input and desired output. It looks like you want to merge on one column and then calculate the distance. The merge will be the key. – MichaelChirico Aug 01 '15 at 21:33

2 Answers2

3

Without a reproducible example, all I can do is offer you a general solution.

I like data.table and the syntax here will look very simple. Check out the Getting Started vignettes for more on the package.

I'm going to create two data.tables that match your general description first:

library(data.table)
set.seed(1734)
A<-data.table(Name=1:10,x=rnorm(10),key="Name")
B<-data.table(Name=1:10,y=rnorm(10),key="Name")

Now, we want to merge A and B by Name (to merge, we need a key set, which I've conveniently done already), then use the respective x and y coordinates to calculate (Euclidean) distance. To do so is simple:

A[B,distance:=sqrt(x^2+y^2)]

The distance you seek is now stored in the data.table A under the column distance. If you don't want to store the distance, and just want the output, you could do: A[B,sqrt(x^2+y^2)].

To start from scratch if A and B are already stored as data.frames, it's not much more complicated:

setDT(A,key="Name")[setDT(B,key="Name"),distance:=sqrt(x^2+y^2)]

We've used the convenient setDT function to convert A and B (in-line) to a data.table by reference, simultaneously declaring the key to be Name for both*.

*It may not be strictly necessary to set the key of B, but I think it is good practice to do so. Also, the key option of setDT is only currently available in the development version of data.table (1.9.5+); with the CRAN version, use setkey(setDT(A),Name), etc.

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
3

For calculating the distance between lat/long points, you can use the distm function from the geosphere package. Within this function you can use several formula's for calculating the distance: distCosine, distHaversine, distVincentySphere and distVincentyEllipsoid. The last one is considered the most accurate one (according to the package author).

library(geosphere)

A <- data.frame(Lat=1:4, Long=1:4, Name=c("a","b","c","d"))
B <- data.frame(Lat=5:8, Long=5:8, Name=c("a","b","c","d"))

A$distance <- distVincentyEllipsoid(A[,c('Long','Lat')], B[,c('Long','Lat')])

this gives:

> A
  Lat Long Name distance
1   1    1    a 627129.5
2   2    2    b 626801.7
3   3    3    c 626380.6
4   4    4    d 625866.6

Note that you have to include the lat/long columns in the order of first longitude and then latitude.


Although this works perfectly on this simple example, in larger datasets where the names are not in the same order, this will lead to problems. In that case you can use data.table and set the keys so you can match the points and calculate the distance (as @MichaelChirico did in his answer):

library(data.table)
A <- data.table(Lat=1:4, Long=1:4, Name=c("a","b","c","d"), key="Name")
B <- data.table(Lat=8:5, Long=8:5, Name=c("d","c","b","a"), key="Name")

A[B,distance:=distVincentyEllipsoid(A[,.(Long,Lat)], B[,.(Long,Lat)])]

as you can see, this gives the correct (i.e., the same) result as in the previous method:

> A
   Lat Long Name distance
1:   1    1    a 627129.5
2:   2    2    b 626801.7
3:   3    3    c 626380.6
4:   4    4    d 625866.6

To see what key="Name" does, compare the following two datatables:

B1 <- data.table(Lat=8:5, Long=8:5, Name=c("d","c","b","a"), key="Name")
B2 <- data.table(Lat=8:5, Long=8:5, Name=c("d","c","b","a"))

See also this answer for a more elaborate example.

Community
  • 1
  • 1
Jaap
  • 81,064
  • 34
  • 182
  • 193