6

What is the fastest method for matrix multiplication of an Eigen::Matrix over a random set of column indices?

Eigen::MatrixXd mat = Eigen::MatrixXd::Random(100, 1000);
// vector of random indices (linspaced here for brevity)
Eigen::VectorXi idx = VectorXi::LinSpaced(8,1000,9);

I'm using RcppEigen and R, which is still on a 3.x version of Eigen (no support for () with index arrays), and regardless, my understanding is that the () operator still performs a deep copy.

Right now I'm doing a deep copy and generating a new matrix with data only for columns in idx:

template <typename T>
inline Eigen::Matrix<T, -1, -1> subset_cols(const Eigen::Matrix<T, -1, -1>& x, const std::vector<size_t>& cols) {
    Eigen::Matrix<T, -1, -1> y(x.rows(), cols.size());
    for (size_t i = 0; i < cols.size(); ++i)
        y.col(i) = x.col(cols[i]);
    return y;
}

and then doing matrix multiplication:

Eigen::MatrixXd sub_mat = subset_cols(mat, idx);
Eigen::MatrixXd a = sub_mat * sub_mat.transpose();

a is what I want.

There must be some way to avoid a deep copy and instead use Eigen::Map?

Edit 5/9/22: In reply to @Markus, who proposed an approach using raw data access and Eigen::Map. The proposed solution is a bit slower than matrix multiplication of a deep copy. Benchmarking here is done with Rcpp code and R:

//[[Rcpp::depends(RcppClock)]]
#include <RcppClock.h>

//[[Rcpp::export]]
void bench(Eigen::MatrixXd mat, Eigen::VectorXi idx){
  Rcpp::Clock clock;
  size_t reps = 100;
  while(reps-- > 0){
    clock.tick("copy");
    Eigen::MatrixXd sub_mat = subset_cols(mat, idx);
    Eigen::MatrixXd a = sub_mat * sub_mat.transpose();
    clock.tock("copy");
    
    clock.tick("map");
    double *b_raw = new double[mat.rows() * mat.rows()];
    Eigen::Map<Eigen::MatrixXd> b(b_raw, mat.rows(), mat.rows());
    subset_AAt(b_raw, mat, idx);
    clock.tock("map");
  }
  clock.stop("clock");
}

Here are three runs of a 100,000-column matrix with 100 rows. We are doing matrix multiplication on (1) a subset of 10 columns, (2) a subset of 1000 columns, and (3) a subset of 10000 columns.

R:

bench(
  matrix(runif(100000 * 100), 100, 100000), 
  sample(100000, 10) - 1)

# Unit: microseconds 
# ticker   mean     sd   min    max neval
#    copy  31.65  4.376 30.15  69.46   100
#     map 113.46 21.355 68.54 166.29   100

bench(
  matrix(runif(100000 * 100), 100, 100000), 
  sample(100000, 1000) - 1)

#  Unit: milliseconds 
#  ticker  mean     sd   min   max neval
#    copy 2.361 0.5789 1.972  4.86   100
#     map 9.495 2.4201 7.962 19.90   100

bench(
  matrix(runif(100000 * 100), 100, 100000), 
  sample(100000, 10000) - 1)

#  Unit: milliseconds 
#  ticker   mean     sd    min   max neval
#    copy  23.04  2.774  20.95  42.4   100
#     map 378.14 19.424 351.56 492.0   100

I benchmarked on a few machines with similar results. Above results are from a good HPC node.

Edit: 5/10/2022 Here is a code snippet that performs matrix multiplication for a subset of columns as quickly as any code not directly using the Eigen BLAS:

template <typename T>
Eigen::Matrix<T, -1, -1> subset_AAt(const Eigen::Matrix<T, -1, -1>& A, const Eigen::VectorXi& cols) {
  const size_t n = A.rows();
  Eigen::Matrix<T, -1, -1> AAt(n, n);
  for (size_t k = 0; k < cols.size(); ++k) {
    const T* A_data = A.data() + cols(k) * n;
    for (size_t i = 0; i < n; ++i) {
      T tmp_i = A_data[i];
      for (size_t j = 0; j <= i; ++j) {
        AAt(i * n + j) += tmp_i * A_data[j];
      }
    }
  }
  return AAt;
}
zdebruine
  • 3,687
  • 6
  • 31
  • 50
  • 1
    I played around with it a bit. `Eigen::Map` will not work because the strides are non-equidistant. Using [slicling](https://eigen.tuxfamily.org/dox/group__TutorialSlicingIndexing.html) gives me ~10% better performance than your `subset_cols()` way on Linux with clang and gcc, but worse on MSVC. As you noted, it is not available on the 3.3 branch. There is a [custom](https://eigen.tuxfamily.org/dox/TopicCustomizing_NullaryExpr.html#title1) way to mimic it, but it performed always worse in my tests. The best improvement (~1.5x faster) I get by enabling AVX (maybe you could even enable AVX512?). – Sedenion May 10 '22 at 19:59
  • @Sedenion thanks for your effort in benchmarking alternative approaches. Your ideas make sense, but it seems like any gains may just be very marginal. Yes, in my personal use I'm working with enabled AVX and also Intel MKL but performance for the average user is my first concern. – zdebruine May 10 '22 at 20:19

2 Answers2

3

Exploiting symmetry

You can exploit that the resulting matrix will be symmetric like so:

Mat sub_mat = subset_cols(mat, idx); // From your original post
Mat a = Mat::Zero(numRows, numRows);
a.selfadjointView<Eigen::Lower>().rankUpdate(sub_mat); // (1)
a.triangularView<Eigen::Upper>() = a.transpose(); // (2)

Line (1) will compute a += sub_mat * sub_mat.transpose() for the lower part only. (2) will then write the lower part to the upper part. Also see the documentation (here and here). Of course, if you can live with only the lower part, step (2) can be omitted.

For a 100x100000 matrix mat, I get a speed up of a factor of roughly

  • ~1.1x when taking 10 columns,
  • ~1.5x when taking 100 columns,
  • ~1.7x when taking 1000 columns

both on Windows using MSVC and on Linux using clang with full optimizations and AVX.

Enabling parallelization

Another way to speed up the computation is to enable parallelization by compiling with OpenMP. Eigen takes care of the rest. The code above that exploits the symmetry does not benefit from it, however. But the original code

Eigen::MatrixXd sub_mat = subset_cols(mat, idx);
Eigen::MatrixXd a = sub_mat * sub_mat.transpose();

does.

For a 100x100000 matrix mat, using clang on Linux, running with 4 threads (on 4 real cores) and comparing to a single thread, I get a speed up of a factor of roughly

  • ~1.0x when taking 10 columns, i.e. no speed up at all
  • ~1.8x when taking 100 columns
  • ~2.0x when taking 1000 columns

In other words, 4 cores or more outperform the symmetric method shown above except for a very small number of columns. Using only 2 cores was always slower. Note that using SMT hurt the performance in my tests, sometimes notably.

Other notes

I already wrote this in the comment, but for the sake of completeness: Eigen::Map will not work because the strides are non-equidistant. Using slicing gives me ~10% better performance than your copying method on Linux with clang and gcc, but somewhat worse on MSVC. Also, as you noted, it is not available on the 3.3 branch of Eigen. There is a custom way to mimic it, but it performed always worse in my tests. Also, in my tests, it did not save any memory compared to the copying method.

I think it is hard to beat the copying method itself regarding performance because the Eigen matrices are column major by default, meaning that copying a few columns is rather cheap. Moreover, without really knowing details, I suspect that Eigen can then throw the full might of its optimization on the full matrix to compute the product and transpose without having to deal with views or anything like this. This might give Eigen more chances for vectorization or cache locality.

Apart from this, not only optimizations should be turned on but also the highest possible instruction set should be used. Turning on AVX in my tests improved the performance by ~1.5x. Unfortunately, I cannot test AVX512.

Sedenion
  • 5,421
  • 2
  • 14
  • 42
  • Very nice. The point on symmetry is really effective, definitely helps. Thanks! – zdebruine May 11 '22 at 01:25
  • @zdebruine I edited my post with another way to speed up the computation by enabling parallelization via OpenMP. – Sedenion May 11 '22 at 19:01
  • To be honest parallelization is the way forward for matrix mul. If you can use OpenCL, you will find plenty of optimized implementations that use shared memory of the compute cores of your GPU hardware, and with OpenCL you can also fall back to the CPU if necessary. There are other options but massive parallel is the right answer imho, especially when you have lots of matrices that are not interdependent. – StarShine May 12 '22 at 07:54
  • 1
    @zdebruine If my answer is fine for you, could you please accept it? – Sedenion May 13 '22 at 06:10
  • @Sedenion of course, this is much appreciated. Will be hitting production in a well-used package soon :) – zdebruine May 16 '22 at 13:32
1

In case anyone finds this helpful down the road, I was able to beat the performance of the Eigen code in the accepted question using OpenMP and triangular indexing. In this case I'm using Rcpp::NumericMatrix, but you could plug Eigen::MatrixXd right in:

    Rcpp::NumericMatrix Rcpp_AAt(const Rcpp::NumericMatrix& mat) {
    const size_t n = mat.cols();
    const size_t n_vals = n / 2 * (1 + n) - n;
    Rcpp::NumericMatrix res(n, n);
    #pragma omp parallel for
    for (size_t k = 0; k < (n_vals + n); ++k) {
        // k is linear index
        if (k >= n_vals) {
            size_t i = k - n_vals;
            double tmp = 0;
            for (size_t row = 0; row < mat.rows(); ++row)
                tmp += mat(row, i) * mat(row, i);
            res(i, i) = tmp;
        } else {
            size_t i = n - 2 - std::floor(std::sqrt(-8 * k + 4 * n * (n - 1) - 7) / 2.0 - 0.5);
            size_t j = k + i + 1 - n * (n - 1) / 2 + (n - i) * ((n - i) - 1) / 2;
            double tmp = 0;
            for (size_t row = 0; row < mat.rows(); ++row)
                tmp += mat(row, i) * mat(row, j);
            res(i, j) = tmp;
            res(j, i) = tmp;
        }
    }
    return res;
}

By using triangular indexing, we are allowing for OpenMP to spawn off threads for all combinations of columns, which is more efficient than just parallelizing across one column at a time (for obvious reasons). Eigen uses multithreading, so I figure this is fair game.

zdebruine
  • 3,687
  • 6
  • 31
  • 50