-1

I'm trying to add a "distance" column to a huge (near 6 million rows) dataframe with coordinate information as start_lng, start_lat, end_lng, end_lat columns.

I have tried the following:

trips$distance <- distm(c(trips$start_lng, trips$start_lat), c(trips$end_lng, trips$end_lat), fun = distHaversine)`

to which I get:

"Error in .pointsToMatrix(x) : Wrong length for a vector, should be 2"

I checked the answers in here and the solution should be:

trips %>%
  rowwise() %>%
  mutate(distance = distHaversine(c(trips$start_lng, trips$start_lat), c(trips$end_lng, trips$end_lat)))

but I still get the same error: "base::stop("Wrong length for a vector, should be 2")"

I have also tried using cbind() instead of c() but "cannot allocate vector of size 123096.7 Gb"

Phil
  • 7,287
  • 3
  • 36
  • 66
bsiq
  • 3
  • 1

1 Answers1

0

Using c() joins the two vectors together so c(trips$end_lng, trips$end_lat) isn't of length 2, it's length is equal to twice the number of rows in your data set. This is why the approach isn't working.

Your second approach is almost correct (although you don't need to use trips$), see this small example:

trips <- tibble::tibble(
  start_lng = c(56.2, 57.3, 56.2, 58.3),
  start_lat = c(76.2, 73.3, 76.2, 78.3),
  end_lng = c(56.3, 57.1, 56.5, 58.2),
  end_lat = c(75.2, 74.3, 75.3, 77.3)
)
trips %>% 
  rowwise() %>% 
  mutate(distance = geosphere::distHaversine(c(start_lng, start_lat),
                                             c(end_lng, end_lat)))

The "cannot allocate vector of size 123096.7 Gb" warning is due to insufficient RAM.

nrennie
  • 1,877
  • 1
  • 4
  • 14
  • What's the difference between using trips$ and not using it? I'm assuming using it would pass the whole column as a huge vector, is that it? Is it because of rowrise() or I should avoid it generally after trips %>% ? (It's been processing for a while with no errors, but oh it's taking its time... Can't say it works yet heh) – bsiq Feb 12 '23 at 01:43
  • It worked and return the df with an additional column, but the column was not saved in trips :( – bsiq Feb 12 '23 at 01:46
  • If you want to save it, you'd need to assign it to something e.g. `trips = trips %>% rowwise() %>% ...` – nrennie Feb 12 '23 at 01:52
  • I would generally avoid using `trips$` - it's not needed in the {tidyverse} for this type of piped workflow. It also means that the data you're passing into `mutate()`, isn't the same data you're calling your vectors from in the distance function - it's fine in this case, but could easily cause bigger problems if there were more steps before the mutate function. – nrennie Feb 12 '23 at 01:56