6

The problem I am having is that I have to calculate a Euclidean distance matrix between shapes that can range from 20,000 up to 60,000 points, which produces 10-20GB amounts of data. I have to run each of these calculates thousands of times so 20GB x 7,000 (each calculation is a different point cloud). The shapes can be either 2D or 3D.

EDITED (Updated questions)

  1. Is there a more efficient way to calculate the forward and backward distances without using two separate nested loops?

    I know I could save the data matrix and calculate the minimum distances in each direction, but then there is a huge memory issue with large point clouds.

  2. Is there a way to speed up this calculation and/or clean up the code to trim off time?

The irony is that I only need the matrix to calculate a very simple metric, but it requires the entire matrix to find that metric (Average Hausdorff distance).

Data example where each column represents a dimension of the shape and each row is a point in the shape:

first_configuration <- matrix(1:6,2,3)
second_configuration <- matrix(6:11,2,3)
colnames(first_configuration) <- c("x","y","z")
colnames(second_configuration) <- c("x","y","z")

This code calculates a Euclidean distance between between coordinates:

m <- nrow(first_configuration)
n <- nrow(second_configuration)

D <- sqrt(pmax(matrix(rep(apply(first_configuration * first_configuration, 1, sum), n), m, n, byrow = F) + matrix(rep(apply(second_configuration * second_configuration, 1, sum), m), m, n, byrow = T) - 2 * first_configuration %*% t(second_configuration), 0))
D

Output:

     [,1]      [,2]
[1,] 8.660254 10.392305
[2,] 6.928203  8.660254

EDIT: included hausdorff average code

d1 <- mean(apply(D, 1, min))
d2 <- mean(apply(D, 2, min))
average_hausdorff <- mean(d1, d2)

EDIT (Rcpp solution): Here is my attempt to implement it in Rcpp so the matrix is never saved to memory. Working now but very slow.

sourceCpp(code=
#include <Rcpp.h>
#include <limits>
using namespace Rcpp;

// [[Rcpp::export]]
double edist_rcpp(NumericVector x, NumericVector y){
    double d = sqrt( sum( pow(x - y, 2) ) );
    return d;
}


// [[Rcpp::export]]
double avg_hausdorff_rcpp(NumericMatrix x, NumericMatrix y){
    int nrowx = x.nrow();
    int nrowy = y.nrow();
    double new_low_x = std::numeric_limits<int>::max();
    double new_low_y = std::numeric_limits<int>::max();

    double mean_forward = 0;
    double mean_backward = 0;
    double mean_hd; 
    double td; 

    //forward
    for(int i = 0; i < nrowx; i++) {
        for(int j = 0; j < nrowy; j++) {
            NumericVector v1 = x.row(i);
            NumericVector v2 = y.row(j);
            td = edist_rcpp(v1, v2);
            if(td < new_low_x) {
                new_low_x = td;
            }
        }
        mean_forward = mean_forward + new_low_x;
        new_low_x = std::numeric_limits<int>::max();
    }

    //backward
    for(int i = 0; i < nrowy; i++) {
        for(int j = 0; j < nrowx; j++) {
            NumericVector v1 = y.row(i);
            NumericVector v2 = x.row(j);
            td = edist_rcpp(v1, v2);
            if(td < new_low_y) {
                new_low_y = td;
            }
        }
        mean_backward = mean_backward + new_low_y;
        new_low_y = std::numeric_limits<int>::max();
    }

    //hausdorff mean
    mean_hd = (mean_forward / nrowx + mean_backward / nrowy) / 2;

    return mean_hd;
}
)

EDIT (RcppParallel solution): Definitely faster than the serial Rcpp solution and most certainly the R solution. If anyone has tips on how to improve my RcppParallel code to trim off some extra time it would be much appreciated!

sourceCpp(code=
#include <Rcpp.h>
#include <RcppParallel.h>
#include <limits>

// [[Rcpp::depends(RcppParallel)]]
struct minimum_euclidean_distances : public RcppParallel::Worker {
    //Input
    const RcppParallel::RMatrix<double> a;
    const RcppParallel::RMatrix<double> b;

    //Output
    RcppParallel::RVector<double> medm;

    minimum_euclidean_distances(const Rcpp::NumericMatrix a, const Rcpp::NumericMatrix b, Rcpp::NumericVector medm) : a(a), b(b), medm(medm) {}

    void operator() (std::size_t begin, std::size_t end) {
        for(std::size_t i = begin; i < end; i++) {
            double new_low = std::numeric_limits<double>::max();
            for(std::size_t j = 0; j < b.nrow(); j++) {
                double dsum = 0;
                for(std::size_t z = 0; z < b.ncol(); z++) {
                    dsum = dsum + pow(a(i,z) - b(j,z), 2);
                }
                dsum = pow(dsum, 0.5);
                if(dsum < new_low) {
                    new_low = dsum;
                }
            }
            medm[i] = new_low;
        }
    }
};


// [[Rcpp::export]]
double mean_directional_hausdorff_rcpp(Rcpp::NumericMatrix a, Rcpp::NumericMatrix b){
    Rcpp::NumericVector medm(a.nrow());
    minimum_euclidean_distances minimum_euclidean_distances(a, b, medm);
    RcppParallel::parallelFor(0, a.nrow(), minimum_euclidean_distances);    
    double results = Rcpp::sum(medm);
    results = results / a.nrow();
    return results;
}


// [[Rcpp::export]]
double max_directional_hausdorff_rcpp(Rcpp::NumericMatrix a, Rcpp::NumericMatrix b){
    Rcpp::NumericVector medm(a.nrow());
    minimum_euclidean_distances minimum_euclidean_distances(a, b, medm);
    RcppParallel::parallelFor(0, a.nrow(), minimum_euclidean_distances);    
    double results = Rcpp::max(medm);
    return results;
}
)

Benchmarks using large point clouds of sizes 37,775 and 36,659:

//Rcpp serial solution
system.time(avg_hausdorff_rcpp(ll,rr))
   user  system elapsed 
409.143   0.000 409.105 

//RcppParallel solution
system.time(mean(mean_directional_hausdorff_rcpp(ll,rr), mean_directional_hausdorff_rcpp(rr,ll)))
   user  system elapsed 
260.712   0.000  33.265 
JJL
  • 342
  • 2
  • 17
  • With sets with that many points, perhaps you should invest time in finding ways to compute Hausdorff distance that doesn't involve the naïve computation of all pairwise distances. This might give you some ideas: https://www.researchgate.net/publication/276681327_An_Efficient_Algorithm_for_Calculating_the_Exact_Hausdorff_Distance – John Coleman Nov 10 '17 at 01:18
  • @JohnColeman Thanks, I will check that out! – JJL Nov 10 '17 at 01:23
  • @JohnColeman I wonder if this would be extendable to variations of Hausdorff though? In particular the average Hausdorff. – JJL Nov 10 '17 at 01:30

1 Answers1

2

I try to use JuliaCall to do the calculation for the average Hausdorff distance. JuliaCall embeds Julia in R.

I only try a serial solution in JuliaCall. It seems to be faster than the RcppParallel and the Rcpp serial solution in the question, but I don't have the benchmark data. Since ability for parallel computation is built in Julia. A parallel computation version in Julia should be written without much difficulty. I will update my answer after finding that out.

Below is the julia file I wrote:

# Calculate the min distance from the k-th point in as to the points in bs
function min_dist(k, as, bs)
    n = size(bs, 1)
    p = size(bs, 2)
    dist = Inf
    for i in 1:n
        r = 0.0
        for j in 1:p
            r += (as[k, j] - bs[i, j]) ^ 2
            ## if r is already greater than the upper bound, 
            ## then there is no need to continue doing the calculation
            if r > dist
                continue
            end
        end
        if r < dist
            dist = r
        end
    end
    sqrt(dist)
end

function avg_min_dist_from(as, bs)
    distsum = 0.0
    n1 = size(as, 1)
    for k in 1:n1
        distsum += min_dist_from(k, as, bs)
    end
    distsum / n1
end

function hausdorff_avg_dist(as, bs)
    (avg_min_dist_from(as, bs) + avg_min_dist_from(bs, as)) / 2
end

And this is the R code to use the julia function:

first_configuration <- matrix(1:6,2,3)
second_configuration <- matrix(6:11,2,3)
colnames(first_configuration) <- c("x","y","z")
colnames(second_configuration) <- c("x","y","z")

m <- nrow(first_configuration)
n <- nrow(second_configuration)

D <- sqrt(matrix(rep(apply(first_configuration * first_configuration, 1, sum), n), m, n, byrow = F) + matrix(rep(apply(second_configuration * second_configuration, 1, sum), m), m, n, byrow = T) - 2 * first_configuration %*% t(second_configuration))
D

d1 <- mean(apply(D, 1, min))
d2 <- mean(apply(D, 2, min))
average_hausdorff <- mean(d1, d2)

library(JuliaCall)
## the first time of julia_setup could be quite time consuming
julia_setup()
## source the julia file which has our hausdorff_avg_dist function
julia_source("hausdorff.jl")

## check if the julia function is correct with the example
average_hausdorff_julia <- julia_call("hausdauff_avg_dist",
                                      first_configuration,
                                      second_configuration)
## generate some large random point clouds
n1 <- 37775
n2 <- 36659
as <- matrix(rnorm(n1 * 3), n1, 3)
bs <- matrix(rnorm(n2 * 3), n2, 3)

system.time(julia_call("hausdauff_avg_dist", as, bs))

The time on my laptop was less than 20 seconds, note this is performance of the serial version of JuliaCall! I used the same data to test RCpp serial solution in the question, which took more than 10 minutes to run. I don't have RCpp parallel on my laptop now so I can't try that. And as I said, Julia has built-in ability to do parallel computation.

Consistency
  • 2,884
  • 15
  • 23
  • Thanks! I've never heard of the Julia language before. I will give this a try. I see you used the inner loop break. Does that still allow an accurate calculation of the average hausdorff? I was under the impression it only works with the maximum. – JJL Nov 14 '17 at 07:41
  • @JJL Yes, I think so. Because the target of the function is to calculate the min distance of a point to a point cloud, the thing in the outer loop is `if r < dist dist = r`, so if I break the inner loop with condition `r > dist`, it shouldn't make any difference. In my test, without the inner loop breaking, this julia implementation is still faster than the RCpp serial implementation but seems to be slower than the RCpp parallel. I'm wondering if any further optimization is possible. And I usually find optimization in Julia is quite easy if you know how to do it. – Consistency Nov 14 '17 at 14:19
  • How can you be sure another point further down the second matrix won't have a smaller distance? What if the order of the points in the matrices are nonsensical? It seems to me the inner breaking would be highly dependent on the order of the points, but in my experience real life scanning is not that orderly. – JJL Nov 14 '17 at 21:03
  • @JJL Oh, my mistake. The `break` should be `continue`. Thank you for pointing that out! – Consistency Nov 14 '17 at 21:41
  • @JJL After correcting the mistake. The timing is still less than or around 20 seconds on my laptop. – Consistency Nov 14 '17 at 21:49