calculating geographic distance between two rows in data.table

Question

My question is basically the same as this: calculating distance between two row in a data.table but I am looking for the answer using data.table syntax, not a for loop.

I have a data.table like this:

Lat      Lon      Time                   Bus
52.21808 20.96675 2018-04-20 21:27:26    3
52.25882 20.89850 2018-04-20 21:27:23    8
52.24347 21.08460 2018-04-20 21:27:27    1
52.21935 20.97186 2018-04-20 21:28:31    3
52.25808 20.89790 2018-04-20 21:28:32    8
52.24541 21.08522 2018-04-20 21:28:36    1

I want to calculate the distance between two consecutive points, grouping by Bus, using e.g. distGeo from the geosphere package. So something like:

d[,distance:=distGeo(c(Lon, Lat), ???????),by=Bus]

EDIT I get somewhat useful results using

d[,distance:=distGeo(cbind(Lon, Lat)),by=Bus]

but not completely correct: there is a warning that one item for each group needs to be recycled. Is there a way to get NA in either the first or the last row for each bus?

EDIT 2 Looks like I have it.

d[,distance:=c(distGeo(cbind(Lon, Lat)),NA) ,by=Bus]

If there are exactly two points per Bus, `distGeo(c(Lon[1], Lat[1]), c(Lon[2], Lat[2]))`, I guess. If there might be more than two points, maybe look at `?shift`. I am not familiar with the syntax for distGeo and the example above is not easy to copy-paste into R to reproduce. — Frank, Apr 24 '18 at 17:47
I do believe, `distGeo` does take a `matrix` as an argument. Instead of `c(Lon,Lat) maybe you should look into `cbind(Lon,Lat)` only...In that case I do not think you will need the second argument?? — Onyambu, Apr 24 '18 at 17:53
@Onyambu this seems to be working! The only thing is I get a warning because there is one row where answer is undefined: "Supplied 34 items to be assigned to group 1 of size 35 in column 'distance' (recycled leaving remainder of 1 items)." and so on — MonikaP, Apr 24 '18 at 18:07
d[,distance:=distGeo(cbind(Lon, Lat)),by=Bus] works. For a given row, I get the distance between this row and next, for the last row the first distance is recycled which is misleading, I'd prefer to get NA there, or ideally NA in first row — MonikaP, Apr 24 '18 at 18:14

SymbolixAU · Answer 1 · 2018-05-01T21:51:57.827

Create two new columns by shifting the Lat/Lon rows up one place:

setorder(dt, Bus)

dt[, `:=`(Lat_to = shift(Lat, type = "lead"),
          Lon_to = shift(Lon, type = "lead")),
     by = Bus]

Use this function I wrote for this answer (it's a more efficient, data.table-style haversine calculation)

dtHaversine <- function(lat_from, lon_from, lat_to, lon_to, r = 6378137){
  radians <- pi/180
  lat_to <- lat_to * radians
  lat_from <- lat_from * radians
  lon_to <- lon_to * radians
  lon_from <- lon_from * radians
  dLat <- (lat_to - lat_from)
  dLon <- (lon_to - lon_from)
  a <- (sin(dLat/2)^2) + (cos(lat_from) * cos(lat_to)) * (sin(dLon/2)^2)
  return(2 * atan2(sqrt(a), sqrt(1 - a)) * r)
}

Apply it

dt[, dist := dtHaversine(Lat, Lon, Lat_to, Lon_to)]

dt
#         Lat      Lon       Date     Time Bus   Lat_to   Lon_to      dist
# 1: 52.24347 21.08460 2018-04-20 21:27:27   1 52.24541 21.08522 220.05566
# 2: 52.24541 21.08522 2018-04-20 21:28:36   1       NA       NA        NA
# 3: 52.21808 20.96675 2018-04-20 21:27:26   3 52.21935 20.97186 376.08498
# 4: 52.21935 20.97186 2018-04-20 21:28:31   3       NA       NA        NA
# 5: 52.25882 20.89850 2018-04-20 21:27:23   8 52.25808 20.89790  91.96366
# 6: 52.25808 20.89790 2018-04-20 21:28:32   8       NA       NA        NA

Data

library(data.table)

dt <- fread(
'Lat      Lon      Date         Time          Bus
52.21808 20.96675 2018-04-20 21:27:26    3
52.25882 20.89850 2018-04-20 21:27:23    8
52.24347 21.08460 2018-04-20 21:27:27    1
52.21935 20.97186 2018-04-20 21:28:31    3
52.25808 20.89790 2018-04-20 21:28:32    8
52.24541 21.08522 2018-04-20 21:28:36    1')

An exmaple on 1million rows

set.seed(123)
dt <- data.table(Lat = sample(-90:90, 1e6, replace = T),
                                 Lon = sample(-90:90, 1e6, replace = T),
                                 Bus = rep(1:5e5,2))


setorder(dt, Bus)
system.time({
    dt[, `:=`(Lat_to = shift(Lat, type = "lead"),
              Lon_to = shift(Lon, type = "lead")),
         by = Bus]
    dt[, dist := dtHaversine(Lat, Lon, Lat_to, Lon_to)] 
})
#  user  system elapsed 
# 7.985   0.033   8.020

won't creating two new columns be inefficient if I have a large dataset? My actual dataset has 1,5 MB. — MonikaP, May 01 '18 at 18:13
@MonikaP it should be quick; 1.5m is not really a lot of rows. And there's only one way for you to find out - try it. — SymbolixAU, May 01 '18 at 19:09
@MonikaP I've added an example on 1million rows to show you the speed. — SymbolixAU, May 01 '18 at 21:52

score 0 · Answer 2 · answered Apr 24 '18 at 18:44

Here is a solution using package gmt:

require(data.table)
require(gmt)

set.seed(123)
some_latlon <- data.table(id = sample(x = 1:2, size = 10, replace = TRUE),
                          xfrom = runif(n = 10, min = 3, max = 6),
                          yfrom = runif(n = 10, min = 52, max = 54))

setkey(some_latlon, id)
some_latlon[, xto := c(xfrom[-1], NA), by = id]
some_latlon[, yto := c(yfrom[-1], NA), by = id]

some_latlon[, dist := geodist(Nfrom = yfrom, Efrom = xfrom,
                              Nto = yto, Eto = xto, units = "km"), by = id]

You can easily remove cols xto and yto of course. HTH

I think this is similar to the previous answer in that it creates new columns. — MonikaP, May 01 '18 at 18:15

score 0 · Answer 3 · answered Dec 04 '19 at 13:24

geodist::geodist would work too and it is faster than geosphere::distHaversine.

require(data.table)
require(microbenchmark)

d = 
fread( 
'
Lat,Lon,Time,Bus
52.21808,20.96675,2018-04-20 21:27:26,3
52.25882,20.89850,2018-04-20 21:27:23,8
52.24347,21.08460,2018-04-20 21:27:27,1
52.21935,20.97186,2018-04-20 21:28:31,3
52.25808,20.89790,2018-04-20 21:28:32,8
52.24541,21.08522,2018-04-20 21:28:36,1
')

setorder(d, Bus, Time)

microbenchmark(

 d[, dist_geodist := geodist::geodist(cbind(Lat, Lon),
      measure='haversine', sequential = TRUE) , by = Bus]
,
 d[,dist_geosphere := geosphere::distHaversine(cbind(Lon, Lat) ) , by=Bus]   
 )

Unit: microseconds
                                                                                                                expr      min
 d[, `:=`(dist_geodist, geodist::geodist(cbind(Lat, Lon), measure = "haversine",      sequential = TRUE)), by = Bus]  861.937
                                 d[, `:=`(dist_geosphere, geosphere::distHaversine(cbind(Lon,      Lat))), by = Bus] 1005.890
        lq      mean    median       uq      max neval cld
  868.7585  910.8999  875.4555  920.138 1463.567   100  a 
 1016.2335 1065.2952 1028.3775 1070.428 1738.151   100   b

calculating geographic distance between two rows in data.table

3 Answers3

Data