2

I've been using the ConjugateGradient solver in Eigen 3.2 and decided to try upgrading to Eigen 3.3.3 with the hope of benefiting from the new multi-threading features.

Sadly, the solver seems slower (~10%) when I enable -fopenmp with GCC 4.8.4. Looking at xosview, I see that all 8 cpus are being used, yet performance is slower...

After some testing, I discovered that if I disable compiler optimization (use -O0 instead of -O3), then -fopenmp does speed up the solver by ~50%.

Of course, it's not really worth disabling optimization just to benefit from multi-threading, since that would be even slower overall.

Following advice from https://stackoverflow.com/a/42135567/7974125, I am storing the full sparse matrix and passing Lower|Upper as the UpLo parameter.

I've also tried each of the 3 preconditioners and also tried using RowMajor matrices, to no avail.

Is there anything else to try to get the full benefits of both multi-threading and compiler optimization?

I cannot post my actual code, but this is a quick test using the Laplacian example from Eigen's documentation, except for some changes to use ConjugateGradient instead of SimplicialCholesky. (Both of these solvers work with SPD matrices.)

#include <Eigen/Sparse>
#include <bench/BenchTimer.h>
#include <iostream>
#include <vector>

using namespace Eigen;
using namespace std;

// Use RowMajor to make use of multi-threading
typedef SparseMatrix<double, RowMajor> SpMat;
typedef Triplet<double> T;

// Assemble sparse matrix from
// https://eigen.tuxfamily.org/dox/TutorialSparse_example_details.html
void insertCoefficient(int id, int i, int j, double w, vector<T>& coeffs,
                       VectorXd& b, const VectorXd& boundary)
{
  int n = int(boundary.size());
  int id1 = i+j*n;
        if(i==-1 || i==n) b(id) -= w * boundary(j); // constrained coefficient
  else  if(j==-1 || j==n) b(id) -= w * boundary(i); // constrained coefficient
  else  coeffs.push_back(T(id,id1,w));              // unknown coefficient
}

void buildProblem(vector<T>& coefficients, VectorXd& b, int n)
{
  b.setZero();
  ArrayXd boundary = ArrayXd::LinSpaced(n, 0,M_PI).sin().pow(2);
  for(int j=0; j<n; ++j)
  {
    for(int i=0; i<n; ++i)
    {
      int id = i+j*n;
      insertCoefficient(id, i-1,j, -1, coefficients, b, boundary);
      insertCoefficient(id, i+1,j, -1, coefficients, b, boundary);
      insertCoefficient(id, i,j-1, -1, coefficients, b, boundary);
      insertCoefficient(id, i,j+1, -1, coefficients, b, boundary);
      insertCoefficient(id, i,j,    4, coefficients, b, boundary);
    }
  }
}

int main()
{
  int n = 300;  // size of the image
  int m = n*n;  // number of unknowns (=number of pixels)
  // Assembly:
  vector<T> coefficients;          // list of non-zeros coefficients
  VectorXd b(m);                   // the right hand side-vector resulting from the constraints
  buildProblem(coefficients, b, n);
  SpMat A(m,m);
  A.setFromTriplets(coefficients.begin(), coefficients.end());
  // Solving:
  // Use ConjugateGradient with Lower|Upper as the UpLo template parameter to make use of multi-threading
  BenchTimer t;
  t.reset(); t.start();
  ConjugateGradient<SpMat, Lower|Upper> solver(A);
  VectorXd x = solver.solve(b);         // use the factorization to solve for the given right hand side
  t.stop();
  cout << "Real time: " << t.value(1) << endl; // 0=CPU_TIMER, 1=REAL_TIMER
  return 0;
}

Resulting output:

// No optimization, without OpenMP
g++ cg.cpp -O0 -I./eigen -o cg
./cg
Real time: 23.9473

// No optimization, with OpenMP
g++ cg.cpp -O0 -I./eigen -fopenmp -o cg
./cg
Real time: 17.6621

// -O3 optimization, without OpenMP
g++ cg.cpp -O3 -I./eigen -o cg
./cg
Real time: 0.924272

// -O3 optimization, with OpenMP
g++ cg.cpp -O3 -I./eigen -fopenmp -o cg
./cg
Real time: 1.04809
Community
  • 1
  • 1
Leon
  • 21
  • 5
  • 2
    You have to try with openmp for different threads count using omp_set_num_threads to 4. maybe the memory is your bottle neck. starting 8 threads they would fight for accessing the memory and would reduce the performance. – Mahmoud Fayez May 06 '17 at 20:27

1 Answers1

2

Your problem is too small to expect any benefits from multi-threading. Sparse matrices are expected to at least one order of magnitude larger. Eigen's code should be adjusted to reduce the number of threads in this case.

Moreover, I guess that you only have 4 physical cores, so running with OMP_NUM_THREADS=4 ./cg might help.

ggael
  • 28,425
  • 2
  • 65
  • 71
  • If you mean 4 cores with hyperthreading use also omp_places=cores (but this isn't implemented on g++ windows) – tim18 May 07 '17 at 12:04
  • To clarify, the example code in my question is using a 90000*90000 matrix. The line `int n = 300; // size of the image` may have been misleading. The actual number of unknowns are n*n pixels. Is 90000*90000 still too small? – Leon May 07 '17 at 18:33
  • It may not be surprising if a task which runs only 1 second and has limited parallelism according to the sequential nature of cg doesn't gain from parallelism. – tim18 May 08 '17 at 08:19
  • Right, `90000*90000` is large enough. Here, using your code, I get 0.75s without openmp and 0.45s with openmp. I did not disabled turboboost, so the sequential version run at higher clock rate. – ggael May 09 '17 at 09:25