0

I have data with below format (number of rows: ~ 1 million)

head(dt)
   pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude
1:        -74.00394        40.74289         -73.99337         40.73425
2:        -73.97386        40.75219         -73.95870         40.77253
3:        -73.95441        40.76442         -73.97078         40.75835
4:        -73.96234        40.76722         -73.97551         40.75687
5:        -74.00466        40.70743         -73.99937         40.72152
6:        -73.99557        40.71602         -73.99997         40.74332

library(geosphere)
dt = data.table(pickup_longitude = c(-74.00394, -73.97386, -73.95441, -73.96234, -74.00466, -73.99557),
            pickup_latitude = c(40.74289, 40.75219, 40.76442, 40.76722, 40.70743, 40.71602), 
            dropoff_longitude = c(-73.99337, -73.95870, -73.97078, -73.97551, -73.99937, -73.99997),
            dropoff_latitude = c(40.73425, 40.77253, 40.75835, 40.75687, 40.72152, 40.74332))
dt[, distance := apply(dt, 1, function(t) distm(x = c(t[1], t[2]), y = c(t[3], t[4])))]

I have used the above code using apply as the function distm in geosphere package is not vectorized. But, the apply in the above code is taking lot of time.

I have also tried:

dt[, distance := distm(x = c(pickup_longitude, pickup_latitude), y = c(dropoff_longitude, dropoff_latitude)), by = 1:nrow(dt)]

What else could be a better and faster way of calculating the distances?

www
  • 38,575
  • 12
  • 48
  • 84
Kartheek Palepu
  • 972
  • 8
  • 29

1 Answers1

0

I have tried this.

dt[, distance := distHaversine(matrix(c(pickup_longitude, pickup_latitude), ncol = 2),
                        matrix(c(dropoff_longitude, dropoff_latitude), ncol = 2))]

This worked perfectly fine.

Community
  • 1
  • 1
Kartheek Palepu
  • 972
  • 8
  • 29