Geographical distance by group - Applying a function on each pair of rows

Question

I want to calculate the average geographical distance between a number of houses per province.

Suppose I have the following data.

df1 <- data.frame(province = c(1, 1, 1, 2, 2, 2),
              house = c(1, 2, 3, 4, 5, 6),
              lat = c(-76.6, -76.5, -76.4, -75.4, -80.9, -85.7), 
              lon = c(39.2, 39.1, 39.3, 60.8, 53.3, 40.2))

Using the geosphere library I can find the distance between two houses. For instance:

library(geosphere)
distm(c(df1$lon[1], df1$lat[1]), c(df1$lon[2], df1$lat[2]), fun = distHaversine)

#11429.1

How do I calculate the distance between all the houses in the province and gather the mean distance per province?

The original data-set has millions of observations per province, so performance is an issue here, too.

@Oka it calculates eucilidian distance. Look into functionbody and tweak it to get your desired output. That'd the best performance-wise. — M--, Apr 10 '19 at 22:51
@M-M agree on performance, but not sure if `dist` can give "haversine" distance similar to one used in question — Oka, Apr 10 '19 at 23:13
dist can't. You need to employ its code and write up your own function. — M--, Apr 10 '19 at 23:29
I am leaning towards awarding the bounty to @Alexis. wake_wake, please let me know what you think, if you wish, by the end of the day. — M--, Apr 18 '19 at 16:03
I just accepted @Alexis answer. Its the fastest solution for my real problem. — wake_wake, Apr 18 '19 at 16:17
I also really liked @dww for thinking beyond the question asked to get to the real problem. — wake_wake, Apr 18 '19 at 16:17

Alexis · Accepted Answer · 2019-04-26T17:36:12.277

My initial idea was to look at the source code of distHaversine and replicate it in a function that I would use with proxy. That would work like this (note that lon is expected to be the first column):

library(geosphere)
library(dplyr)
library(proxy)

df1 <- data.frame(province = as.integer(c(1, 1, 1, 2, 2, 2)),
                  house = as.integer(c(1, 2, 3, 4, 5, 6)),
                  lat = c(-76.6, -76.5, -76.4, -75.4, -80.9, -85.7), 
                  lon = c(39.2, 39.1, 39.3, 60.8, 53.3, 40.2))

custom_haversine <- function(x, y) {
  toRad <- pi / 180

  diff <- (y - x) * toRad
  dLon <- diff[1L]
  dLat <- diff[2L]

  a <- sin(dLat / 2) ^ 2 + cos(x[2L] * toRad) * cos(y[2L] * toRad) * sin(dLon / 2) ^ 2
  a <- min(a, 1)
  # return
  2 * atan2(sqrt(a), sqrt(1 - a)) * 6378137
}

pr_DB$set_entry(FUN=custom_haversine, names="haversine", loop=TRUE, distance=TRUE)

average_dist <- df1 %>%
  select(-house) %>%
  group_by(province) %>%
  group_map(~ data.frame(avg=mean(proxy::dist(.x[ , c("lon", "lat")], method="haversine"))))

However, if you're expecting millions of rows per province, proxy probably won't be able to allocate the intermediate (lower triangular of the) matrices. So I ported the code to C++ and added multi-threading as a bonus:

EDIT: turns out the s2d helper was far from optimal, this version now uses the formulas given here.

EDIT2: I just found out about RcppThread, and it can be used to detect user interrupt.

// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::depends(RcppParallel,RcppThread)]]

#include <cstddef> // size_t
#include <math.h> // sin, cos, sqrt, atan2, pow
#include <vector>

#include <RcppThread.h>
#include <Rcpp.h>
#include <RcppParallel.h>

using namespace std;
using namespace Rcpp;
using namespace RcppParallel;

// single to double indices for lower triangular of matrices without diagonal
void s2d(const size_t id, const size_t nrow, size_t& i, size_t& j) {
  j = nrow - 2 - static_cast<size_t>(sqrt(-8 * id + 4 * nrow * (nrow - 1) - 7) / 2 - 0.5);
  i = id + j + 1 - nrow * (nrow - 1) / 2 + (nrow - j) * ((nrow - j) - 1) / 2;
}

class HaversineCalculator : public Worker
{
public:
  HaversineCalculator(const NumericVector& lon,
                      const NumericVector& lat,
                      double& avg,
                      const int n)
    : lon_(lon)
    , lat_(lat)
    , avg_(avg)
    , n_(n)
    , cos_lat_(lon.length())
  {
    // terms for distance calculation
    for (size_t i = 0; i < cos_lat_.size(); i++) {
      cos_lat_[i] = cos(lat_[i] * 3.1415926535897 / 180);
    }
  }

  void operator()(size_t begin, size_t end) {
    // for Kahan summation
    double sum = 0;
    double c = 0;

    double to_rad = 3.1415926535897 / 180;

    size_t i, j;
    for (size_t ind = begin; ind < end; ind++) {
      if (RcppThread::isInterrupted(ind % static_cast<int>(1e5) == 0)) return;

      s2d(ind, lon_.length(), i, j);

      // haversine distance
      double d_lon = (lon_[j] - lon_[i]) * to_rad;
      double d_lat = (lat_[j] - lat_[i]) * to_rad;
      double d_hav = pow(sin(d_lat / 2), 2) + cos_lat_[i] * cos_lat_[j] * pow(sin(d_lon / 2), 2);
      if (d_hav > 1) d_hav = 1;
      d_hav = 2 * atan2(sqrt(d_hav), sqrt(1 - d_hav)) * 6378137;

      // the average part
      d_hav /= n_;

      // Kahan sum step
      double y = d_hav - c;
      double t = sum + y;
      c = (t - sum) - y;
      sum = t;
    }

    mutex_.lock();
    avg_ += sum;
    mutex_.unlock();
  }

private:
  const RVector<double> lon_;
  const RVector<double> lat_;
  double& avg_;
  const int n_;
  tthread::mutex mutex_;
  vector<double> cos_lat_;
};

// [[Rcpp::export]]
double avg_haversine(const DataFrame& input, const int nthreads) {
  NumericVector lon = input["lon"];
  NumericVector lat = input["lat"];

  double avg = 0;
  int size = lon.length() * (lon.length() - 1) / 2;
  HaversineCalculator hc(lon, lat, avg, size);

  int grain = size / nthreads / 10;
  RcppParallel::parallelFor(0, size, hc, grain);
  RcppThread::checkUserInterrupt();

  return avg;
}

This code won't allocate any intermediate matrix, it will simply calculate the distance for each pair of what would be the lower triangular and accumulate the values for an average in the end. See here for the Kahan summation part.

If you save that code in, say, haversine.cpp, then you can do the following:

library(dplyr)
library(Rcpp)
library(RcppParallel)
library(RcppThread)

sourceCpp("haversine.cpp")

df1 %>%
  group_by(province) %>%
  group_map(~ data.frame(avg=avg_haversine(.x, parallel::detectCores())))
# A tibble: 2 x 2
# Groups:   province [2]
  province     avg
     <int>   <dbl>
1        1  15379.
2        2 793612.

Here's a sanity check too:

pr_DB$set_entry(FUN=geosphere::distHaversine, names="distHaversine", loop=TRUE, distance=TRUE)

df1 %>%
  select(-house) %>%
  group_by(province) %>%
  group_map(~ data.frame(avg=mean(proxy::dist(.x[ , c("lon", "lat")], method="distHaversine"))))

A word of caution though:

df <- data.frame(lon=runif(1e3, -90, 90), lat=runif(1e3, -90, 90))

system.time(proxy::dist(df, method="distHaversine"))
   user  system elapsed 
 34.353   0.005  34.394

system.time(proxy::dist(df, method="haversine"))
   user  system elapsed 
  0.789   0.020   0.809

system.time(avg_haversine(df, 4L))
   user  system elapsed 
  0.054   0.000   0.014

df <- data.frame(lon=runif(1e5, -90, 90), lat=runif(1e5, -90, 90))

system.time(avg_haversine(df, 4L))
   user  system elapsed 
 73.861   0.238  19.670

You'll probably have to wait quite a while if you have millions of rows...

I should also mention that it's not possible to detect user interrupt inside the threads created through RcppParallel, so if you start the calculation you should either wait until it finishes, or restart R/RStudio entirely. See EDIT2 above.

Regarding complexity

Depending on your actual data and how many cores your computer has, you may very well end up waiting days for the calculation to finish. This problem has quadratic complexity (per province, so to speak). This line:

int size = lon.length() * (lon.length() - 1) / 2;

signifies the amount of (haversine) distance calculations that must be performed. So if the number of rows increases by a factor of n, the number of calculations increases by a factor of n^2 / 2, roughly speaking.

There is no way to optimize this; you can't calculate the average of N numbers without actually computing each number first, and you'll have a hard time finding something faster than multi-threaded C++ code, so you'll either have to wait it out, or throw more cores at the problem, either with a single machine or with many machines working together. Otherwise you can't solve this problem.

it is much faster than my solution when the number of rows increase in magnitude (1e3, 1e4, ...) — Etienne Kintzler, Apr 15 '19 at 20:16
@Alexis running this on a server I get an error `could not find function "group_map"`, while I have `dplyr` and all the other packages attached. Any ideas why this is popping up? — wake_wake, Apr 19 '19 at 17:01
@Alexis Ah, `dplyr` is outdated. Can't update to the new version because it requires an updated version of `BH` that is not availble for `Microsoft R Open`, it seems. Any ideas on how to tackle this hurdle? — wake_wake, Apr 19 '19 at 17:34
@wake_wake I'm away from my computer right now, but `do` was the way to go before, so I'm guessing in the last step you could use something like `do(data.frame(province=.$province[1L], avg=avg_haversine(., parallel::detectCores())))` instead. — Alexis, Apr 19 '19 at 18:38
@Alexis Thank you! I just fixed it by updating 10 packages manually. But your comment is helpful for others who are stuck with the old `dplyr` for now. Thanks! — wake_wake, Apr 19 '19 at 18:40
@wake_wake I found a way to detect user interrupt, in case you find it useful. See the edit. — Alexis, Apr 27 '19 at 21:26

dww · Answer 2 · 2019-04-15T22:54:49.027

7

Given that your data has millions of rows, this sounds like an "XY" problem. I.e. the answer you really need is not the answer to the question you asked.

Let me give an analogy: if you want to know the average height of trees in a forest you do not measure every tree. You just measure a large enough sample to ensure that your estimate has a high enough probability of being as close to the true average as you need.

Performing a brute force calculation using the distance from every house to every other house will not only take excessive resources (even with optimised code), but also it will provide far more decimal places than you could possibly need, or are justified by the data accuracy (GPS coordinates are typically only correct to within a few meters at best).

So, I would recommend doing the calculation on a sample size that is only as large as required for the level of accuracy your problem demands. For example, the following will provide an estimate on two million rows that is good to 4 significant figures within only a few seconds. You can increase the accuracy by increasing the sample size, but given the uncertainty in the GPS coordinates themselves, I doubt this is warranted.

sample.size=1e6    
lapply(split(df1[3:4], df1$province), 
  function(x) {
    s1 = x[sample(nrow(x), sample.size, T), ]
    s2 = x[sample(nrow(x), sample.size, T), ]
    mean(distHaversine(s1, s2))
  })

Some big data to test on:

N=1e6
df1 <- data.frame(
  province = c(rep(1,N),rep(2,N)),
  house = 1:(2*N),
  lat = c(rnorm(N,-76), rnorm(N,-85)), 
  lon = c(rnorm(N,39), rnorm(N,-55,2)))

To get a sense of the accuracy of this method, we can use bootstrapping. For the following demo, I use just 100,000 rows of data so that we can perform 1000 bootstrap iterations in a short time:

N=1e5
df1 <- data.frame(lat = rnorm(N,-76,0.1), lon = rnorm(N,39,0.1))

dist.f = function(i) {
    s1 = df1[sample(N, replace = T), ]
    s2 = df1[sample(N, replace = T), ]
    mean(distHaversine(s1, s2))
    }

boot.dist = sapply(1:1000, dist.f)
mean(boot.dist)
# [1] 17580.63
sd(boot.dist)
# [1] 29.39302

hist(boot.dist, 20)

I.e. for these test data, the mean distance is 17,580 +/- 29 m. That is a coefficient of variation of 0.1%, which is likely accurate enough for most purposes. As I said, you can get more accuracy by increasing the sample size if you really need to.

edited Apr 15 '19 at 22:54

answered Apr 14 '19 at 03:31

dww

30,425
5
68
111

It's a good point. What do you think of clustering houses based on their coordination instead "unsupervised" sampling? – M-- Apr 15 '19 at 02:37
I'm not sure what you mean by "clustering by coordination". But if the intent is to estimate the mean distance for the whole population, then you should randomly sample for the whole population. If you are interested in disaggregating the results by some categorisation, then sample within the categories of interest. – dww Apr 15 '19 at 12:02
Let's say calculating a mean of lat/long for neighboring houses! – M-- Apr 15 '19 at 21:07
I don't think that will work as clusters will have different number of houses, which will throw the calculation off. I still think sampling the whole population is the best. I added some stats to the answer to give a better sense of how well this method works. – dww Apr 15 '19 at 22:56
Can you elaborate on that different number of houses? I just can't wrap my head around it as I have gone down that path for O/D studies before and it proved to work. – M-- Apr 15 '19 at 23:00
Imagine a simple example - there are two houses right next to eachother (say one meter apart), and one house a hundred km away. If you cluster first, then take the average distance between clusters, you would get 100km as the average. But the true answer should be (100+100+1)/3 or 67km. So, the problem is not just that the clusters have different numbers, but that also the clustering ignores the distance between houses within clusters when taking the average. – dww Apr 15 '19 at 23:10
Your simple example is flawed. Let's use it for your random sampling solution. You take only the two houses close to each other and calculate distance of 1. You simply cannot downscale the size of dataset to represent or reject credibility of solutions. – M-- Apr 16 '19 at 13:35
2

This is not the place for a lengthy discussion about clustering (which is not part of the original question). If you want to follow up further about that, maybe starting a new question on cross-validated would be best. I would note here however, that my example is not flawed and does work with the sampling method - you just need to have a large enough sample. If you sample the 3 houses 1 million times with replacement (similar to my example), you will for sure get an answer extremely close to the true average. – dww Apr 16 '19 at 13:41

score 4 · Answer 3 · edited Jun 20 '20 at 09:12

Solution:

lapply(split(df1, df1$province), function(df){
  df <- Expand.Grid(df[, c("lat", "lon")], df[, c("lat", "lon")])
  mean(distHaversine(df[, 1:2], df[, 3:4]))
})

where Expand.Grid() is taken from https://stackoverflow.com/a/30085602/3502164.

Explanation:

1. Performance

I would avoid using distm() as it transforms a vectorised function distHaversine() into an unvectorised distm(). If you look at the source code you see:

function (x, y, fun = distHaversine) 
{
   [...]
   for (i in 1:n) {
        dm[i, ] = fun(x[i, ], y)
    }
    return(dm)
}

While distHaversine() sends the "whole object" to C, distm() sends the data "row-wise" to distHaversine() and therefore forces distHaversine() to do the same when executing the code in C. Therefore, distm() should not be used. In terms of performance i see more harm using the wrapper function distm() as i see benefits.

2. Explaining the code in "solution":

a) Splitting in groups:

You want to analyse the data per group: province. Splitting into groups can be done by: split(df1, df1$province).

b) Grouping "clumps of columns"

You want to find all unique combinations of lat/lon. First guess might be expand.grid(), but that does not work for mulitple columns. Luckily Mr. Flick took care of this expand.grid function for data.frames in R.

Then you have a data.frame() of all possible combinations and just have to use mean(distHaversine(...)).

M-- · Answer 4 · 2019-04-12T18:03:25.080

In reference to this thread, the vectorized solution for your problem would be like below;

toCheck <- sapply(split(df1, df1$province), function(x){
                                            combn(rownames(x), 2, simplify = FALSE)})

names(toCheck) <- sapply(toCheck, paste, collapse = " - ")


sapply(toCheck, function(x){
               distm(df1[x[1],c("lon","lat")], df1[x[2],c("lon","lat")], 
                     fun = distHaversine)
                           })


  #    1 - 2      1 - 3      2 - 3      4 - 5      4 - 6      5 - 6 
  # 11429.10   22415.04   12293.48  634549.20 1188925.65  557361.28

This works if number of records for each province is the same. If that's not the case, then the second part for assigning the appropriate names to toCheck and how we use it at the end should be changed as the structure of the toCheck list changes. It does not care about the order of dataset though.

for your actual dataset, toCheck will become a nested list, so you need to tweak the function like below; I have not made toCheck names clean for this solution. (df2 can be found at the end of answer).

df2 <- df2[order(df2$province),] #sorting may even improve performance
names(toCheck) <- paste("province", unique(df2$province))

toCheck <- sapply(split(df2, df2$province), function(x){
                                            combn(rownames(x), 2, simplify = FALSE)})

sapply(toCheck, function(x){ sapply(x, function(y){
  distm(df2[y[1],c("lon","lat")], df2[y[2],c("lon","lat")], fun = distHaversine)
})})

# $`province 1`
# [1]   11429.10   22415.04 1001964.84   12293.48 1013117.36 1024209.46
# 
# $`province 2`
# [1]  634549.2 1188925.7  557361.3
# 
# $`province 3`
# [1] 590083.2
# 
# $`province 4`
# [1] 557361.28 547589.19  11163.92

You can further get the mean() for each province. Also, if you need to, it should not be hard to rename elements of nested lists so you can tell each distance corresponds to what houses.

df2 <- data.frame(province = c(1, 1, 1, 2, 2, 2, 1, 3, 3, 4,4,4),
                  house = c(1, 2, 3, 4, 5, 6, 7, 10, 9, 8, 11, 12),
                  lat = c(-76.6, -76.5, -76.4, -75.4, -80.9, -85.7, -85.6, -76.4, -75.4, -80.9, -85.7, -85.6), 
                  lon = c(39.2, 39.1, 39.3, 60.8, 53.3, 40.2, 40.1, 39.3, 60.8, 53.3, 40.2, 40.1))

Interesting! The number of records per province change a lot, so it needs to be tweaked a bit, indeed. — wake_wake, Apr 12 '19 at 16:11
@wake_wake I update my answer. I am not sure that this would be an example for branch prediction, but I sorted dataframe anyway. Unless someone comes up with an **optimized** `data.table` solution, this should be the best performance-wise. Good luck. — M--, Apr 12 '19 at 18:05
This works great for the mock data and (part of the) original data. Yet, at some point the `toCheck` gets so large, it exhausts the memory of my 128GB local machine. — wake_wake, Apr 12 '19 at 21:11
@wake_wake handling bigdata has its own challenges and techniques that would not fall within the scope of this question. Your best bet without getting your hands dirty with higher level of solutions would be breaking your data into smaller chunks, perform the solution above, clear the memory, and move to the next chunk. — M--, Apr 12 '19 at 21:14

Oka · Answer 5 · 2019-04-11T13:19:36.310

My 10 cents. You can:

# subset the province
df1 <- df1[which(df1$province==1),]

# get all combinations
all <- combn(df1$house, 2, FUN = NULL, simplify = TRUE)

# run your function and get distances for all combinations
distances <- c()
for(col in 1:ncol(all)) {
  a <- all[1, col]
  b <- all[2, col]
  dist <- distm(c(df1$lon[a], df1$lat[a]), c(df1$lon[b], df1$lat[b]), fun = distHaversine)
  distances <- c(distances, dist)
  }

# calculate mean:
mean(distances)
# [1] 15379.21

This gives you the mean value for the province, which you can compare with the results of other methods. For example sapply which was mentioned in the comments:

df1 <- df1[which(df1$province==1),]
mean(sapply(split(df1, df1$province), dist))
# [1] 1.349036

As you can see, it gives different results, cause dist function can calculate the distances of different type (like euclidean) and cannot do haversine or other "geodesic" distances. The package geodist seems to have options which could bring you closer to sapply:

library(geodist)
library(magrittr)

# defining the data
df1 <- data.frame(province = c(1, 1, 1, 2, 2, 2),
                  house = c(1, 2, 3, 4, 5, 6),
                  lat = c(-76.6, -76.5, -76.4, -75.4, -80.9, -85.7), 
                  lon = c(39.2, 39.1, 39.3, 60.8, 53.3, 40.2))

# defining the function 
give_distance <- function(resultofsplit){
  distances <- c()
  for (i in 1:length(resultofsplit)){
    sdf <- resultofsplit
    sdf <- sdf[[i]]
    sdf <- sdf[c("lon", "lat", "province", "house")]

    sdf2 <- as.matrix(sdf)
    sdf3 <- geodist(x=sdf2, measure="haversine")
    sdf4 <- unique(as.vector(sdf3))
    sdf4 <- sdf4[sdf4 != 0]        # this is to remove the 0-distances 
    mean_dist <- mean(sdf4)
    distances <- c(distances, mean_dist)
    }  
    return(distances)
}

split(df1, df1$province) %>% give_distance()
#[1]  15379.21 793612.04

E.g. the function will give you the mean distance values for each province. Now, I didn´t manage to get give_distance work with sapply, but this should already be more efficient.

Etienne Kintzler · Answer 6 · 2019-04-14T19:02:09.357

You can use a vectorized version of haversine distance, such as :

dist_haversine_for_dfs <- function (df_x, df_y, lat, r = 6378137) 
{
  if(!all(c("lat", "lon") %in% names(df_x))) {
    stop("parameter df_x does not have column 'lat' and 'lon'")
  }
  if(!all(c("lat", "lon") %in% names(df_y))) {
    stop("parameter df_x does not have column 'lat' and 'lon'")
  }
  toRad <- pi/180
  df_x <- df_x * toRad
  df_y <- df_y * toRad
  dLat <- df_y[["lat"]] - df_x[["lat"]]
  dLon <- df_y[["lon"]] - df_x[["lon"]]
  a <- sin(dLat/2) * sin(dLat/2) + cos(df_x[["lat"]]) * cos(df_y[["lat"]]) * 
    sin(dLon/2) * sin(dLon/2)
  a <- pmin(a, 1)
  dist <- 2 * atan2(sqrt(a), sqrt(1 - a)) * r
  return(dist)
}

Then using data.table and the package arrangements (for faster combinations generation) you can do the following :

library(data.table)
dt <- data.table(df1)
ids <- dt[, {
  comb_mat <- arrangements::combinations(x = house, k = 2)
  list(house_x = comb_mat[, 1],
       house_y = comb_mat[, 2])}, by = province]

jdt <- cbind(ids, 
             dt[ids$house_x, .(lon_x=lon, lat_x=lat)], 
             dt[ids$house_y, .(lon_y=lon, lat_y=lat)])

jdt[, dist := dist_haversine_for_dfs(df_x = jdt[, .(lon = lon.x, lat = lat.x)],
                                     df_y = jdt[, .(lon = lon.y, lat = lat.y)])]

jdt[, .(mean_dist = mean(dist)), by = province]

which outputs

   province mean_dist
1:        1  15379.21
2:        2 793612.04

score 0 · Answer 7 · answered Nov 26 '19 at 13:58

I add below a solution using the spatialrisk package. The key functions in this package are written in C++ (Rcpp), and are therefore very fast.

library(data.table)
library(tidyverse)
library(spatialrisk)
library(optiRum)

# Expand grid
grid <- function(x){
  df <- x[, lat, lon]
  optiRum::CJ.dt(df, df)
}

Since each element of the output is a data frame, purrr::map_dfr is used to row-bind them together:

data.table(df1) %>%
  split(.$province) %>%
  map_dfr(grid, .id = "province") %>%
  mutate(distm = spatialrisk::haversine(lat, lon, i.lat, i.lon)) %>%
  filter(distm > 0) %>%
  group_by(province) %>%
  summarize(distm_mean = mean(distm))

Output:

  province distm_mean
  <chr>         <dbl>
1 1            15379.
2 2           793612.

Geographical distance by group - Applying a function on each pair of rows

7 Answers7

Regarding complexity

Linked