2

I'm trying to find an efficient way to compute X^T * W * X where X is a dense mat of size e.g. 10,000 x 10 and W is a diagonal matrix (I store only the diagonal in an vec).

For now, I use this function

arma::mat& getXtW(const arma::mat& covar,
                  const arma::vec& w,
                  arma::mat& tcovar,
                  size_t n, size_t K) {
  size_t i, k;

  for (i = 0; i < n; i++) {
    for (k = 0; k < K; k++) {
      tcovar(k, i) = covar(i, k) * w(i);
    }
  }

  return tcovar;
}

and compute

tcovar = getXtW(covar, w, tcovar, n, K);
cprod = tcovar * covar;

Yet, this seems not optimal.

PS: You can see the whole code there.

Edit1: Seems I can use covar.t() * (covar.each_col() % w), but this doesn't seem to be much faster.

Edit2: If I implement it myself with loops in Rcpp:

arma::mat testProdW2(const arma::mat& x, const arma::vec& w) {

  int n = x.n_rows;
  int K = x.n_cols;
  arma::mat res(K, K);
  double tmp;
  for (int k = 0; k < K; k++) {
    for (int j = k; j < K; j++) {
      tmp = 0;
      for (int i = 0; i < n; i++) {
        tmp += x(i, j) * w[i] * x(i, k);
      }
      res(j, k) = tmp;
    }
  }

  for (int k = 0; k < K; k++) {
    for (int j = 0; j < k; j++) {
      res(j, k) = res(k, j);
    }
  }

  return res;
}

This is slower than the first implementation.

F. Privé
  • 11,423
  • 2
  • 27
  • 78
  • Back in the day, that is the type of function that folks like my thesis advisor had special-cased. That probably still pays -- by keeping track of the indices you can skip the transpose, and just fold in `W` as you move along. – Dirk Eddelbuettel Jan 23 '18 at 21:55
  • @DirkEddelbuettel Not sure I understand. Can you rephrase please? – F. Privé Jan 23 '18 at 21:56
  • Imagine writing a matrix mult routine by hand. Now imagine doing it for a matrix and its transpose. You can optmise. Now imagine adding a vector of diagonal weights. You can still optimise. – Dirk Eddelbuettel Jan 23 '18 at 21:58
  • @DirkEddelbuettel You would adivce to implement this specific product myself (with something like 3 for loops)? – F. Privé Jan 23 '18 at 22:03
  • Please explain what else question was about if not this. – Dirk Eddelbuettel Jan 23 '18 at 22:11
  • @DirkEddelbuettel I was hoping that someone knew some clever trick in Armadillo to do this (something like my first edit). – F. Privé Jan 23 '18 at 22:13
  • @DirkEddelbuettel Also see my second edit. – F. Privé Jan 23 '18 at 22:16
  • Did you even try to google this? https://stackoverflow.com/questions/47013581/blas-matrix-by-matrix-transpose-multiply – Dirk Eddelbuettel Jan 23 '18 at 22:19
  • As I said, it is a common problem, and people optimise over this (for speed and memory) for a long time. I probably ... would still stick with a normal implementation. Premature optimization and all that. – Dirk Eddelbuettel Jan 23 '18 at 22:21
  • Yet, I was searching for a specific Armadillo implementation. I don't know how to use some subroutines in m Rcpp code. – F. Privé Jan 23 '18 at 22:23
  • Is this question still relevant? With the solution from your "Edit1" a matrix with 10000 rows and 10 takes about 1 ms, which is already quite fast. – Ralf Stubner Jul 15 '18 at 18:25
  • @RalfStubner The problem is that I need to do this operation millions of times. – F. Privé Jul 16 '18 at 07:20

1 Answers1

2

According to BLAS matrix by matrix transpose multiply there is no BLAS routine that does this directly. Instead there is the suggestion to loop over the rows of X and use dsyr. I found this an interesting question since I know how to link BLAS in Rcpp, but have not done so using RcppArmadillo. Stack Overflow knows an answer for that as well: Rcpparmadillo: can't call Fortran routine "dgebal"?. Note: I have not checked but I expect that dsyr is not part of the BLAS subset that come with R. So this will only work if your R is linked to a full BLAS implementation.

Combining this we get:

// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
#include <Rcpp/Benchmark/Timer.h>

#ifdef ARMA_USE_LAPACK
#if !defined(ARMA_BLAS_CAPITALS)
 #define arma_dsyr dsyr
#else
 #define arma_dsyr DSYR
#endif

extern "C"
void arma_fortran(arma_dsyr)(const char* uplo, const int* n, const double* alpha, const double* X, const int* incX, const double* A, const int* ldA);

#endif


// [[Rcpp::export]]
Rcpp::NumericVector getXtWX(const arma::mat& X, const arma::vec& w) {
  Rcpp::Timer timer;
  timer.step("start"); 

  arma::mat result1 = X.t() * (X.each_col() % w);
  timer.step("Armadillo result");

  const int n = X.n_rows;
  const int k = X.n_cols;
  arma::mat result(k, k, arma::fill::zeros);  
  for (size_t i = 0; i < n; ++i) {
    F77_CALL(dsyr)("U", &k, &w(i), &X(i,0), &n, result.memptr(), &k);
  }
  result = arma::symmatu(result);
  timer.step("BLAS result");
  Rcpp::NumericVector res(timer);
  return res;
}

/*** R
n <- 10000
k <- 10
X <- matrix(runif(n*k), n, k)
w <- runif(n)
Reduce(rbind, lapply(1:6, function(x) diff(getXtWX(X, w))/1e6))
*/

However, for me the BLAS solution is quite a bit slower:

> Reduce(rbind, lapply(1:6, function(x) diff(getXtWX(X, w))/1e6))
     Armadillo result BLAS result
init         1.291243    6.666026
             1.176143    6.623282
             1.102111    6.644165
             1.094917    6.612596
             1.098619    6.588431
             1.069286    6.615529

I tried to improve this by first transposing the matrix, in the hope that memory access would be faster when looping over column matrices, but this did not make a difference on my (low-powered) system.

Ralf Stubner
  • 26,263
  • 3
  • 40
  • 75