4

I'm trying to create dist objects from large distance matrices. I'm running out of memory using stats::as.dist. For example, I have around 128 Gb available on current machine but as.dist runs out of memory when processing a 73,000 x 73,000 matrix (approx 42Gb). Given that the final dist object should be less than half the size of the matrix (i.e. it is the lower triangle, stored as a vector) it seems to me that it should be possible to do this calculation in a more memory-efficient way - provided we are careful not to create large intermediate objects and just copy the relevant elements of the input directly to the output.

Looking at the code for getS3method('as.dist', 'default'), I see that it does the calculation using ans <- m[row(m) > col(m)] which requires the creation of row and col matrices of the same dimensionality as the input.

I thought I might be able to improve on this using the algorithm from here to generate the indexes of the lower triangle. Here is my attempt using this method.

as.dist.new = function(dm, diag = FALSE, upper = FALSE) {
  n = dim(dm)[1]
  stopifnot(is.matrix(dm))
  stopifnot(dim(dm)[2] == n)
  k = 1:((n^2 - n)/2)
  j <- floor(((2 * n + 1) - sqrt((2 * n - 1) ^ 2 - 8 * (k - 1))) / 2)
  i <- j + k - (2 * n - j) * (j - 1) / 2
  idx = cbind(i,j)
  remove(i,j,k)
  gc()
  d = dm[idx]

  class(d) <- "dist"
  attr(d, "Size") <- n
  attr(d, "call") <- match.call()
  attr(d, "Diag") <- diag
  attr(d, "Upper") <- upper
  d
}

This works fine on smaller matrices. Here's a simple example:

N = 10
dm = matrix(runif(N*N), N, N) 
diag(dm) = 0

x = as.dist(dm)
y = as.dist.new(dm)

However, if we create a larger distance matrix it runs into the same memory issues as as.dist.

E.g. this version crashes on my system:

N = 73000
dm = matrix(runif(N*N), N, N)
gc() 
diag(dm) = 0
gc()

as.dist.new(dm)

Does anyone have a suggestion how to perform this operation more efficiently? R or Rcpp solutions welcome. NB looking at this answer to a related problem (generating the full distance matrix from 2-column location data) it seems that it may be possible to do this using RcppArmadillo, but I've got no experience of using that.

dww
  • 30,425
  • 5
  • 68
  • 111
  • Please try @mharinga's solution as suggested here: https://stackoverflow.com/questions/37407065/r-faster-way-of-computing-large-distance-matrix – tushaR Jan 25 '20 at 08:04
  • Thanks for the link @tushaR - that package in that answer could be useful to generate distance matrices more quickly. But what I am doing is different. It is converting a distance matrix into a dist object. – dww Jan 25 '20 at 14:17
  • There is no surprise here. You have 42GB matrix + 21GB vector, that is already a 63GB. You can try to have your matrix on disk using e.g. FBMs from package {bigstatsr} (disclaimer: I'm the author). – F. Privé Jan 26 '20 at 07:28
  • Thanks for the suggestion @F.Privé - but my problem is not lack of disk space (I could of course just increase the swap size), but rather that I should have already have enough memory available for the matrix plus vector. So looking for more efficient solutions that limit the amount of disk swapping required. – dww Jan 27 '20 at 01:33
  • You do NOT have enough RAM for the 2 objects + other stuff. This is why I recommend you to have the matrix on disk instead. – F. Privé Jan 27 '20 at 09:38
  • I appreciate that your package could help others so the comments are welcome. However for me "recommend you to have the matrix on disk instead" is exactly what the swap space does on Linux. Effectively I have 128 Gb working memory available. The operation does not fail when it crosses the 64 Gb barrier. It fails when it goes over 128. In any case, I have a more efficient solution now (as per my answer below) so there is no need to put the matrix on disk at all. – dww Jan 27 '20 at 14:03

2 Answers2

2

Well, the logic to traverse the lower triangular is relatively simple, and if you do it in C++ then it can also be fast:

library(Rcpp)

sourceCpp(code='
// [[Rcpp::plugins(cpp11)]]

#include <cstddef> // size_t

#include <Rcpp.h>

using namespace Rcpp;

// [[Rcpp::export]]
NumericVector as_dist(const NumericMatrix& mat) {
  std::size_t nrow = mat.nrow();
  std::size_t ncol = mat.ncol();
  std::size_t size = nrow * (nrow - 1) / 2;
  NumericVector ans(size);

  if (nrow > 1) {
    std::size_t k = 0;
    for (std::size_t j = 0; j < ncol; j++) {
      for (std::size_t i = j + 1; i < nrow; i++) {
        ans[k++] = mat(i,j);
      }
    }
  }

  ans.attr("class") = "dist";
  ans.attr("Size") = nrow;
  ans.attr("Diag") = false;
  ans.attr("Upper") = false;
  return ans;
}
')

as_dist(matrix(1:100, 10, 10))
    1  2  3  4  5  6  7  8  9
2   2                        
3   3 13                     
4   4 14 24                  
5   5 15 25 35               
6   6 16 26 36 46            
7   7 17 27 37 47 57         
8   8 18 28 38 48 58 68      
9   9 19 29 39 49 59 69 79   
10 10 20 30 40 50 60 70 80 90
Alexis
  • 4,950
  • 1
  • 18
  • 37
  • Thanks. Works well on smaller matrices. But runs into the same memory issues on larger ones. Although I confess I don;t know why - it looks as though it *should* work. I'm guessing it has something to do with R creating copies of objects that are passed or returned. But I thought that only pointers are passed between R and C++, so I'm confused why this should be the case. – dww Jan 25 '20 at 17:06
  • @dww weird, what's the exact error message you're getting? And which OS are you using? – Alexis Jan 25 '20 at 17:55
  • I see the memory (both RAM and swap) fill up gradually using system monitor until R session crashes once both are at 100%. Using Linux (Pop!OS) on 64Gb Ram+ 64Gb Swap. – dww Jan 25 '20 at 19:37
  • @dww not sure what could be wrong, maybe try adding some good old `Rcout << "foo"` debugging statements to see if it actually runs the code or breaks as soon as `ans` is allocated? – Alexis Jan 25 '20 at 21:31
  • I've changed approach a bit and am now using a function (based loosely on your answer) that calculates the dist object directly from lat and lon vectors, without generating the intermediate distance matrix at all. This seems the best way forwards. I'll try your debugging suggestion later, as it is also interesting to understand why this version failed. One possibility - shouldn't `j` iterate from `0` to `(ncol-1)`? running the loop up to `ncol` could go out of bounds. Although I'd expect this to throw an out of bounds error rather than just run out of memory. – dww Jan 25 '20 at 22:14
2

My current solution is to calculate the dist object directly from lat and lon vectors, without generating the intermediate distance matrix at all. On large matrices, this is several hundred times faster than the "conventional" pipeline of geosphere::mdist() followed by stats::as.dist() and uses only as much memory as required to store the final dist object.

The following Rcpp source is based on using the functions from here to calculate haversine distance in c++, together with an adaptation of @Alexis method to iterate through the lower triangle elements in c++.

#include <Rcpp.h>
using namespace Rcpp;

double distanceHaversine(double latf, double lonf, double latt, double lont, double tolerance){
  double d;
  double dlat = latt - latf;
  double dlon =  lont - lonf;
  d = (sin(dlat * 0.5) * sin(dlat * 0.5)) + (cos(latf) * cos(latt)) * (sin(dlon * 0.5) * sin(dlon * 0.5));
  if(d > 1 && d <= tolerance){
    d = 1;
  }
  return 2 * atan2(sqrt(d), sqrt(1 - d)) * 6378137.0;
}

double toRadians(double deg){
  return deg * 0.01745329251;  // PI / 180;
}

//-----------------------------------------------------------
// [[Rcpp::export]]
NumericVector calc_dist(Rcpp::NumericVector lat, 
                        Rcpp::NumericVector lon, 
                        double tolerance = 10000000000.0) {
  std::size_t nlat = lat.size();
  std::size_t nlon = lon.size();
  if (nlat != nlon) throw std::range_error("lat and lon different lengths");
  if (nlat < 2) throw std::range_error("Need at least 2 points");
  std::size_t size = nlat * (nlat - 1) / 2;
  NumericVector ans(size);
  std::size_t k = 0;
  double latf;
  double latt;
  double lonf;
  double lont;
  
  for (std::size_t j = 0; j < (nlat-1); j++) {
    for (std::size_t i = j + 1; i < nlat; i++) {
      latf = toRadians(lat[i]);
      lonf = toRadians(lon[i]);
      latt = toRadians(lat[j]);
      lont = toRadians(lon[j]);
      ans[k++] = distanceHaversine(latf, lonf, latt, lont, tolerance);
    }
  }
  
  return ans;
}

/*** R
as_dist = function(lat, lon, tolerance  = 10000000000.0) {
  dd = calc_dist(lat, lon, tolerance)
  attr(dd, "class") = "dist"
  attr(dd, "Size") = length(lat)
  attr(dd, "call") = match.call()
  attr(dd, "Diag") = FALSE
  attr(dd, "Upper") = FALSE
  return(dd)
}
*/
dww
  • 30,425
  • 5
  • 68
  • 111