4

I am trying to program the naive parallel version of Conjugate gradient, so I started with the simple Wikipedia algorithm, and I want to change the dot-products and MatrixVector products by their appropriate parallel version, The Rcppparallel documentation has the code for the dot-product using parallelReduce; I think I'm gonna use that version for my code, but I'm trying to make the MatrixVector multiplication, but I haven't achieved good results compared to R base (no parallel)

Some versions of parallel matrix multiplication: using OpenMP, Rcppparallel, serial version, a serial version with Armadillo, and the benchmark

// [[Rcpp::depends(RcppParallel)]]
#include <Rcpp.h>
#include <RcppParallel.h>
#include <numeric>
// #include <cstddef>
// #include <cstdio>
#include <iostream>
using namespace RcppParallel;
using namespace Rcpp;

struct InnerProduct : public Worker
{   
   // source vectors
   const RVector<double> x;
   const RVector<double> y;

   // product that I have accumulated
   double product;

   // constructors
   InnerProduct(const NumericVector x, const NumericVector y) 
      : x(x), y(y), product(0) {}
   InnerProduct(const InnerProduct& innerProduct, Split) 
      : x(innerProduct.x), y(innerProduct.y), product(0) {}

   // process just the elements of the range I've been asked to
   void operator()(std::size_t begin, std::size_t end) {
      product += std::inner_product(x.begin() + begin, 
                                    x.begin() + end, 
                                    y.begin() + begin, 
                                    0.0);
   }

   // join my value with that of another InnerProduct
   void join(const InnerProduct& rhs) { 
     product += rhs.product; 
   }
};

struct MatrixMultiplication : public Worker
{
   // source matrix
   const RMatrix<double> A;

    //source vector
   const RVector<double> x;

   // destination matrix
   RMatrix<double> out;

   // initialize with source and destination
   MatrixMultiplication(const NumericMatrix A, const NumericVector x, NumericMatrix out) 
     : A(A), x(x), out(out) {}

   // take the square root of the range of elements requested
   void operator()(std::size_t begin, std::size_t end) {
      for (std::size_t i = begin; i < end; i++) {
            // rows we will operate on
            //RMatrix<double>::Row rowi = A.row(i);
            RMatrix<double>::Row rowi = A.row(i);

            //double res = std::inner_product(rowi.begin(), rowi.end(), x.begin(), 0.0);
            //Rcout << "res" << res << std::endl;
            out(i,1) = std::inner_product(rowi.begin(), rowi.end(), x.begin(), 0.0);
            //Rcout << "res" << out(i,1) << std::endl;
      }
    }  
};

// [[Rcpp::export]]
double parallelInnerProduct(NumericVector x, NumericVector y) {

   // declare the InnerProduct instance that takes a pointer to the vector data
   InnerProduct innerProduct(x, y);

   // call paralleReduce to start the work
   parallelReduce(0, x.length(), innerProduct);

   // return the computed product
   return innerProduct.product;
}
//librar(Rbenchmark)

// [[Rcpp::export]]
NumericVector matrixXvectorRcppParallel(NumericMatrix A, NumericVector x) {

   // // declare the InnerProduct instance that takes a pointer to the vector data
   // InnerProduct innerProduct(x, y);
   int nrows = A.nrow();
   NumericVector out(nrows);
   for(int i = 0; i< nrows;i++ )
   {
      out(i) = parallelInnerProduct(A(i,_),x);
   }
   // return the computed product
   return out;
}

// [[Rcpp::export]]
arma::rowvec matrixXvectorParallel(arma::mat A, arma::colvec x){
    arma::rowvec y = A.row(0)*0;
    int filas = A.n_rows;
    int columnas = A.n_cols;
    #pragma omp parallel for
    for(int j=0;j<columnas;j++)
    {
        //y(j) = A.row(j)*x(j))
        y(j) = dotproduct(A.row(j),x);
    }
    return y;
} 

arma::mat matrixXvector2(arma::mat A, arma::mat x){
  //arma::rowvec y = A.row(0)*0;
  //y=A*x;
  return A*x;
}

arma::rowvec matrixXvectorParallel2(arma::mat A, arma::colvec x){
    arma::rowvec y = A.row(0)*0;
    int filas = A.n_rows;
    int columnas = A.n_cols;

 #pragma omp parallel for
    for(int j = 0; j < columnas ; j++){
        double result = 0;
        for(int i = 0; i < filas; i++){
                result += x(i)*A(j,i);   
        }
        y(j) = result;
    }
    return y;
}

Benchmark

                             test replications elapsed relative user.self sys.self user.child sys.child
1                         M %*% a           20   0.026    1.000     0.140    0.060          0         0
2 matrixXvector2(M, as.matrix(a))           20   0.040    1.538     0.101    0.217          0         0
4    matrixXvectorParallel2(M, a)           20   0.063    2.423     0.481    0.000          0         0
3     matrixXvectorParallel(M, a)           20   0.146    5.615     0.745    0.398          0         0
5 matrixXvectorRcppParallel(M, a)           20   0.335   12.885     2.305    0.079          0         0

My last trial at the moment was using parallefor with Rcppparallel, but I'm getting memory errors and I dont have idea where the problem is

// [[Rcpp::export]]
NumericVector matrixXvectorRcppParallel2(NumericMatrix A, NumericVector x) {

   // // declare the InnerProduct instance that takes a pointer to the vector data
   int nrows = A.nrow();
   NumericMatrix out(nrows,1); //allocar mempria de vector de salida
   //crear worker
   MatrixMultiplication matrixMultiplication(A, x, out);


   parallelFor(0,A.nrow(),matrixMultiplication);

   // return the computed product
   return out;
}    

What I notice is that when I check in my terminal using htop how the processors are working, I see in htop when I apply the conventional Matrix vector multiplication using R-base, that is using all the processors, so Does the matrix multiplication perform parallel by default? because in theory, only one processor should be working if is the serial version.

Processors usage

If someone knows which is the better path, OpenMP or Rcppparallel, or another way, that gives me better performance than the apparently serial version of R-base.

The serial code for conjugte gradient at the moment

// [[Rcpp::export]]
arma::colvec ConjugateGradient(arma::mat A, arma::colvec xini, arma::colvec b, int num_iteraciones){
    //arma::colvec xnew = xini*0 //inicializar en 0's
    arma::colvec x= xini; //inicializar en 0's
    arma::colvec rkold = b - A*xini;
    arma::colvec rknew = b*0;
    arma::colvec pk = rkold;
    int k=0;
    double alpha_k=0;
    double betak=0;
    double normak = 0.0;

    for(k=0; k<num_iteraciones;k++){
         Rcout << "iteracion numero " << k << std::endl;
        alpha_k =  sum(rkold.t() * rkold) / sum(pk.t()*A*pk); //sum de un elemento para realizar casting
        (pk.t()*A*pk);
        x = x+ alpha_k * pk;
        rknew = rkold - alpha_k*A*pk;
        normak =  sum(rknew.t()*rknew);
        if( normak < 0.000001){
            break;
        }
        betak = sum(rknew.t()*rknew) / sum( rkold.t() * rkold );

        //actualizar valores para siguiente iteracion
        pk = rknew + betak*pk;
        rkold = rknew;

    }

    return x;

}

I wasn't aware of the use of BLAS in R, thanks Hong Ooi and tim18, so the new benchmark using option(matprod="internal") and option(matprod="blas")

options(matprod = "internal")
res<-benchmark(M%*%a,matrixXvector2(M,as.matrix(a)),matrixXvectorParallel(M,a),matrixXvectorParallel2(M,a),matrixXvectorRcppParallel(M,a),order="relative",replications = 20)
res

                             test replications elapsed relative user.self sys.self user.child sys.child
2 matrixXvector2(M, as.matrix(a))           20   0.043    1.000     0.107    0.228          0         0
4    matrixXvectorParallel2(M, a)           20   0.069    1.605     0.530    0.000          0         0
1                         M %*% a           20   0.072    1.674     0.071    0.000          0         0
3     matrixXvectorParallel(M, a)           20   0.140    3.256     0.746    0.346          0         0
5 matrixXvectorRcppParallel(M, a)           20   0.343    7.977     2.272    0.175          0         0

options(matprod="blas")

options(matprod = "blas")

res<-benchmark(M%*%a,matrixXvector2(M,as.matrix(a)),matrixXvectorParallel(M,a),matrixXvectorParallel2(M,a),matrixXvectorRcppParallel(M,a),order="relative",replications = 20)
res
                             test replications elapsed relative user.self sys.self user.child sys.child
1                         M %*% a           20   0.021    1.000     0.093    0.054          0         0
2 matrixXvector2(M, as.matrix(a))           20   0.092    4.381     0.177    0.464          0         0
5 matrixXvectorRcppParallel(M, a)           20   0.328   15.619     2.143    0.109          0         0
4    matrixXvectorParallel2(M, a)           20   0.438   20.857     3.036    0.000          0         0
3     matrixXvectorParallel(M, a)           20   0.546   26.000     3.667    0.127          0         0
  • 2
    1. How big are those matrices you're testing with? 2. Are you using a multithreaded BLAS (eg the one that ships with Microsoft R)? – Hong Ooi Apr 07 '18 at 08:54
  • 2
    As previous poster hinted, enough multithreaded blas implementations are available that we needn't help you learn how openmp works with various compilers in these situations. Some are free, such as mkl and acml, and may support other threading models besides openmp. – tim18 Apr 07 '18 at 11:38
  • I'm using R 3.4.4 (No Microsoft R) (I'm using R inside a docker containter (https://hub.docker.com/r/rocker/verse/) because I'm having trouble to install some libraries), the test was a matrix 1200x1200 multiplied with a vector of 1200 elements, as far as I know, I'm not using BLAS, but one of my doubts is if R by default use multithread to compute matrix mutliplication because htop appear to notice usage of all processors when I test that. – Adrián Rodriguez Villarreal Apr 07 '18 at 15:54
  • Thanks Hong Ooi and Tim18, I have found in the documentation that in fact is using blas, I wasn't aware of this, so I run a new benchmark using an option that avoid use blass, I will edit my post with the new values – Adrián Rodriguez Villarreal Apr 07 '18 at 16:40
  • for Rcpp did you edit the file cat `.R/Makevars` for OpenMP. My file has one line `CXXFLAGS= -Wall -std=c++14 -O3 -march=native -ffp-contract=fast -fopenmp` – Z boson Apr 09 '18 at 12:36

1 Answers1

3

As you already found out, the base R matrix multiplication can be multi-threaded, if a multi-threaded BLAS implementation is used. This is the case for the rocker/* docker images, which typically use OpenBLAS.

In addition, (Rcpp)Armadillo already uses the BLAS library used by R (in this case multi-threaded OpenBLAS) as well as OpenMP. So your "serial" version is actually multi-threaded. You can verify this in htop with a large enough matrix as input.

BTW, what you are trying to do looks like premature optimization to me.

Ralf Stubner
  • 26,263
  • 3
  • 40
  • 75
  • many thanks Ralf, I did what you say previously, the image that I post is what is showing, but is good to know the why, I think that your answer will close my doubt, and Yes you are right regarding to what I'm doing is premature, but I haven't understand a block version of the alogrithm, so for now I was trying to only do parallel the product and dot-products, but is good to know that given rocker is using OpenBLAS multithreaded by default, what i was trying to do is useless, so I will need to try something different – Adrián Rodriguez Villarreal Apr 09 '18 at 19:00