Constructing distance matrix in parallel in C++11 using OpenMP

Question

I would like to construct a distance matrix in parallel in C++11 using OpenMP. I read various documentations, introductions, examples etc. Yet, I still have a few questions. To facilitate answering this post, I state my questions as assumptions numbered 1 through 7. This way, you can quickly browse through them and point out which ones are correct and which ones are not.

Let us begin with a simple serially executed function computing a dense Armadillo matrix:

// [[Rcpp::export]]
arma::mat compute_dist_mat(arma::mat &coordinates, unsigned int n_points) {
  arma::mat dist_mat(n_points, n_points, arma::fill::zeros);
  double dist {};
  for(unsigned int i {0}; i < n_points; i++) {
    for(unsigned int j = i + 1; j < n_points; j++) {
      dist = compute_dist(coordinates(i, 1), coordinates(j, 1), coordinates(i, 0), coordinates(j, 0));
      dist_mat.at(i, j) = dist;
      dist_mat.at(j, i) = dist;
    }
  }
  return dist_mat;
}

As a side note: this function is supposed to be called from R through the Rcpp interface - indicated by the // [[Rcpp::export]]. And accordingly the top of the file includes

#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::plugins(cpp11)]]
#include <omp.h>
// [[Rcpp::plugins(openmp)]]

using namespace Rcpp;
using namespace arma;

However, the function should work also fine without the R interface.

In an attempt to parallelize the code, I replace the loops with

unsigned int i {};
unsigned int j {};
# pragma omp parallel for private(dist, i, j) num_threads(n_threads) if(n_threads > 1)
for(i = 0; i < n_points; i++) {
  for(j = i + 1; j < n_points; j++) {
    dist = compute_dist(coordinates(i, 1), coordinates(j, 1), coordinates(i, 0), coordinates(j, 0));
    dist_mat.at(i, j) = dist;
    dist_mat.at(j, i) = dist;
  }
}

and add n_threads as an argument to the compute_dist_mat function.

This distributes the iterations of the outer loop across threads, with the iterations of the inner loop executed by the respective thread handling the outer loop.
The two loop levels cannot be combined because the inner loop depends on the outer one.
dist, i, and j are all to be initialized above the # pragma line and then declared private rather than initializing them in the loops.
The # pragma line does not have any effect when n_treads = 1, inducing a serial execution.

Extending the dense matrix application, the following code block illustrates the serial sparse matrix case with batch insertion. To motivate the use of sparse matrices here, I set distances below a certain threshold to zero.

// [[Rcpp::export]]
arma::sp_mat compute_dist_spmat(arma::mat &coordinates, unsigned int n_points, double dist_threshold) {
  std::vector<double> dists;
  std::vector<unsigned int> dist_i;
  std::vector<unsigned int> dist_j;
  double dist {};
  for(unsigned long int i {0}; i < n_points; i++) {
    for(unsigned long int j = i + 1; j < n_points; j++) {
      dist = compute_dist(coordinates(i, 1), coordinates(j, 1), coordinates(i, 0), coordinates(j, 0));
      if(dist >= dist_threshold) {
        dists.push_back(dist);
        dist_i.push_back(i);
        dist_j.push_back(j);
      }
    }
  }
  unsigned int mat_size = dist_i.size();
  arma::umat index_mat(2, mat_size * 2);
  arma::vec dists_vec(mat_size * 2);
  unsigned int j {};
  for(unsigned int i {0}; i < mat_size; i++) {
    j = i * 2;
    index_mat.at(0, j) = dist_i[i];
    index_mat.at(1, j) = dist_j[i];
    index_mat.at(0, j + 1) = dist_j[i];
    index_mat.at(1, j + 1) = dist_i[i];
    dists_vec.at(j) = dists[i];
    dists_vec.at(j + 1) = dists[i];
  }
  arma::sp_mat dist_mat(index_mat, values_vec, n_points, n_points);
  return dist_mat;
}

Because the function does ex ante not know how many distances are above the threshold, it first stores the non-zero values in standard vectors and then constructs the Armadillo objects from them.

I parallelize the function as follows:

// [[Rcpp::export]]
arma::sp_mat compute_dist_spmat(arma::mat &coordinates, unsigned int n_points, double dist_threshold, unsigned short int n_threads) {
  std::vector<std::vector<double>> dists(n_points);
  std::vector<std::vector<unsigned int>> dist_j(n_points);
  double dist {};
  unsigned int i {};
  unsigned int j {};
  # pragma omp parallel for private(dist, i, j) num_threads(n_threads) if(n_threads > 1)
  for(i = 0; i < n_points; i++) {
    for(j = i + 1; j < n_points; j++) {
      dist = compute_dist(coordinates(i, 1), coordinates(j, 1), coordinates(i, 0), coordinates(j, 0));
      if(dist >= dist_threshold) {
        dists[i].push_back(dist);
        dist_j[i].push_back(j);
      }
    }
  }
  unsigned int vec_intervals[n_points + 1];
  vec_intervals[0] = 0;
  for (i = 0; i < n_points; i++) {
    vec_intervals[i + 1] = vec_intervals[i] + dist_j[i].size();
  }
  unsigned int mat_size {vec_intervals[n_points]};
  arma::umat index_mat(2, mat_size * 2);
  arma::vec dists_vec(mat_size * 2);
  unsigned int vec_begins_i {};
  unsigned int vec_length_i {};
  unsigned int k {};
  # pragma omp parallel for private(i, j, k, vec_begins_i, vec_length_i) num_threads(n_threads) if(n_threads > 1)
  for(i = 0; i < n_points; i++) {
    vec_begins_i = vec_intervals[i];
    vec_length_i = vec_intervals[i + 1] - vec_begins_i;
    for(j = 0, j < vec_length_i, j++) {
      k = (vec_begins_i + j) * 2;
      index_mat.at(0, k) = i;
      index_mat.at(1, k) = dist_j[i][j];
      index_mat.at(0, k + 1) = dist_j[i][j];
      index_mat.at(1, k + 1) = i;
      dists_vec.at(k) = dists[i][j];
      dists_vec.at(k + 1) = dists[i][j];
    }
  }
  arma::sp_mat dist_mat(index_mat, dists_vec, n_points, n_points);
  return dist_mat;
}

Using dynamic vectors in the loop is thread-safe.
dist, i, j, k, vec_begins_i, and vec_length_i are all to be initialized above the # pragma line and then declared private rather than initializing them in the loops.
Nothing has to be marked as a section.

Are any of the seven statements incorrect?

3 and 6 are wrong: it works either way, but declaring the variables/objects inside the loop/parallel region is better style and less verbose as it doesn't need the `private` clause. — paleonix, Oct 20 '21 at 16:21
5 is wrong: `std::vector::push_back` and other methods are generally not thread-safe. You just use it in a way that works, because there can be no conflicts between threads (as far as I can see). — paleonix, Oct 20 '21 at 16:25
I don't think the `if(n_threads > 1)` clause should make any difference. Also to avoid load imbalance one should use e.g. `dynamic` scheduling when there is an inner loop with a dependence on the outer (parallelized) loop. — paleonix, Oct 20 '21 at 16:31
I have the vague feeling that we already have parallelised distance matrix helpers at CRAN but I can think of a specific name now (and don't have time to look). It is an _embarassingly parallel_ problem so I think this has been done before... — Dirk Eddelbuettel, Oct 20 '21 at 20:16
Thanks for the comments and suggestions. @DirkEddelbuettel The major packages in the domain, at least, appear to compute distance matrices serially. And I asked this question mainly to clarify a few issues concerning OpenMP. — Chr, Oct 21 '21 at 11:12
If you are OK with ditching OpenMP (it's STILL painful on Mac) I recommend `tbb::concurrent_vector` which you can access from `RcppParallel`. I believe this to be a cleaner and more easily maintained than the vector of vectors approach (works, but is clunky). See here: https://stackoverflow.com/a/49285605/2723734 — thc, Oct 24 '21 at 00:04
@thc I did not know that OpenMP had problems on Mac. Thanks for the hint. — Chr, Oct 24 '21 at 11:35
@Chr have a look at the `ParallelDist` package. It uses OpenMP parallelization in C++. You can even specify an Xptr Cpp function for custom distance functions. That said, you can beat the performance of ParallelDist with some serious chops, but at least it provides a framework for further experimentation. To really improve your function: 1) move to Eigen C++ library (better vectorization), 2) use fewer temporaries and do everything inline, 3) move to sparse matrix iterators, they will use fewer temporaries than your code shown here. Look at Boost::ForwardTraversalIterator for inspiration. — zdebruine, Nov 08 '21 at 12:06
btw OpenMP automatically determines private members without any help, _so long as_ you are not doing dynamic memory allocation inside the `parallel` loop structure (which you shouldn't be doing anyway in this case). — zdebruine, Nov 08 '21 at 12:08

score 2 · Answer 1 · answered Nov 08 '21 at 12:31

The following does not directly answer your question (it's just some dev code I copied from a personal GitHub repo), but it makes several points clear that may be of use in your application:

OpenMP automatically determines private members so long as you are not doing any dynamic memory allocation within the parallel loop
For sparse matrix distance calculations, it becomes important to move beyond a simple calculation of distance at each non-zero index and instead consider the structure of sparsity that is expected, and optimize for that. In the example below, I assume both matrices are very sparse and their intersection is less than their union. Thus, I "precondition" each distance calculation with squared column sums (for calculating Euclidean distance), and then adjust the calculation for the intersection only. This avoids complicated iterator structures and is very fast.
Using as few temporaries as possible is much to your benefit, and sparse matrix iterators do as good of a job of this as any alternative code anyone may ever write.
Eigen provides better vectorization than Armadillo (across the board, I might add) which means you want Eigen instead of Armadillo if those last 20% of performance gains are important to you.

This function calculates the Euclidean distance between all unique pairs of columns in an Eigen::SparseMatrix<double> object:

// sparse column-wise Euclidean distance between all columns
Eigen::MatrixXd distance(Eigen::SparseMatrix<double>& A) {
    Eigen::MatrixXd dists(A.cols(), A.cols());
    Eigen::VectorXd sq_colsums(A.cols());
    for (int col = 0; col < A.cols(); ++col)
        for (Eigen::SparseMatrix<double>::InnerIterator it(A, col); it; ++it)
            sq_colsums(col) += it.value() * it.value();

    #pragma omp parallel for
    for (unsigned int i = 0; i < (A.cols() - 1); ++i) {
        for (unsigned int j = (i + 1); j < A.cols(); ++j) {
            double dist = sq_colsums(i) + sq_colsums(j);
            Eigen::SparseMatrix<double>::InnerIterator it1(A, i), it2(A, j);
            while (it1 && it2) {
                if (it1.row() < it2.row()) ++it1;
                else if (it1.row() > it2.row()) ++it2;
                else {
                    dist -= it1.value() * it1.value();
                    dist -= it2.value() * it2.value();
                    dist += std::pow(it1.value() - it2.value(), 2);
                    ++it1; ++it2;
                }
            }
            dists(i, j) = std::sqrt(dist);
            dists(j, i) = dists(i, j);
        }
    }
    dists.diagonal().array() = 1;
    return dists;
}

As Dirk and others have said, there are packages out there (i.e. ParallelDist) that seem to do everything you're after (for dense matrices). Look at wordspace for fast cosine distance calculations. See here for some comparisons. Cosine distance is easy to efficiently calculate in R without use of Rcpp using crossprod operations (see qlcMatrix::cosSparse source code for algorithmic inspiration).

Constructing distance matrix in parallel in C++11 using OpenMP

1 Answers1