Eigen 3.3 Conjugate Gradient is slower when multi-threaded with GCC compiler optimization

Question

I've been using the ConjugateGradient solver in Eigen 3.2 and decided to try upgrading to Eigen 3.3.3 with the hope of benefiting from the new multi-threading features.

Sadly, the solver seems slower (~10%) when I enable -fopenmp with GCC 4.8.4. Looking at xosview, I see that all 8 cpus are being used, yet performance is slower...

After some testing, I discovered that if I disable compiler optimization (use -O0 instead of -O3), then -fopenmp does speed up the solver by ~50%.

Of course, it's not really worth disabling optimization just to benefit from multi-threading, since that would be even slower overall.

Following advice from https://stackoverflow.com/a/42135567/7974125, I am storing the full sparse matrix and passing Lower|Upper as the UpLo parameter.

I've also tried each of the 3 preconditioners and also tried using RowMajor matrices, to no avail.

Is there anything else to try to get the full benefits of both multi-threading and compiler optimization?

I cannot post my actual code, but this is a quick test using the Laplacian example from Eigen's documentation, except for some changes to use ConjugateGradient instead of SimplicialCholesky. (Both of these solvers work with SPD matrices.)

#include <Eigen/Sparse>
#include <bench/BenchTimer.h>
#include <iostream>
#include <vector>

using namespace Eigen;
using namespace std;

// Use RowMajor to make use of multi-threading
typedef SparseMatrix<double, RowMajor> SpMat;
typedef Triplet<double> T;

// Assemble sparse matrix from
// https://eigen.tuxfamily.org/dox/TutorialSparse_example_details.html
void insertCoefficient(int id, int i, int j, double w, vector<T>& coeffs,
                       VectorXd& b, const VectorXd& boundary)
{
  int n = int(boundary.size());
  int id1 = i+j*n;
        if(i==-1 || i==n) b(id) -= w * boundary(j); // constrained coefficient
  else  if(j==-1 || j==n) b(id) -= w * boundary(i); // constrained coefficient
  else  coeffs.push_back(T(id,id1,w));              // unknown coefficient
}

void buildProblem(vector<T>& coefficients, VectorXd& b, int n)
{
  b.setZero();
  ArrayXd boundary = ArrayXd::LinSpaced(n, 0,M_PI).sin().pow(2);
  for(int j=0; j<n; ++j)
  {
    for(int i=0; i<n; ++i)
    {
      int id = i+j*n;
      insertCoefficient(id, i-1,j, -1, coefficients, b, boundary);
      insertCoefficient(id, i+1,j, -1, coefficients, b, boundary);
      insertCoefficient(id, i,j-1, -1, coefficients, b, boundary);
      insertCoefficient(id, i,j+1, -1, coefficients, b, boundary);
      insertCoefficient(id, i,j,    4, coefficients, b, boundary);
    }
  }
}

int main()
{
  int n = 300;  // size of the image
  int m = n*n;  // number of unknowns (=number of pixels)
  // Assembly:
  vector<T> coefficients;          // list of non-zeros coefficients
  VectorXd b(m);                   // the right hand side-vector resulting from the constraints
  buildProblem(coefficients, b, n);
  SpMat A(m,m);
  A.setFromTriplets(coefficients.begin(), coefficients.end());
  // Solving:
  // Use ConjugateGradient with Lower|Upper as the UpLo template parameter to make use of multi-threading
  BenchTimer t;
  t.reset(); t.start();
  ConjugateGradient<SpMat, Lower|Upper> solver(A);
  VectorXd x = solver.solve(b);         // use the factorization to solve for the given right hand side
  t.stop();
  cout << "Real time: " << t.value(1) << endl; // 0=CPU_TIMER, 1=REAL_TIMER
  return 0;
}

Resulting output:

// No optimization, without OpenMP
g++ cg.cpp -O0 -I./eigen -o cg
./cg
Real time: 23.9473

// No optimization, with OpenMP
g++ cg.cpp -O0 -I./eigen -fopenmp -o cg
./cg
Real time: 17.6621

// -O3 optimization, without OpenMP
g++ cg.cpp -O3 -I./eigen -o cg
./cg
Real time: 0.924272

// -O3 optimization, with OpenMP
g++ cg.cpp -O3 -I./eigen -fopenmp -o cg
./cg
Real time: 1.04809

You have to try with openmp for different threads count using omp_set_num_threads to 4. maybe the memory is your bottle neck. starting 8 threads they would fight for accessing the memory and would reduce the performance. — Mahmoud Fayez, May 06 '17 at 20:27

score 2 · Answer 1 · answered May 07 '17 at 06:54

2

Your problem is too small to expect any benefits from multi-threading. Sparse matrices are expected to at least one order of magnitude larger. Eigen's code should be adjusted to reduce the number of threads in this case.

Moreover, I guess that you only have 4 physical cores, so running with OMP_NUM_THREADS=4 ./cg might help.

answered May 07 '17 at 06:54

ggael

28,425
2
65
71

If you mean 4 cores with hyperthreading use also omp_places=cores (but this isn't implemented on g++ windows) – tim18 May 07 '17 at 12:04
To clarify, the example code in my question is using a 90000*90000 matrix. The line `int n = 300; // size of the image` may have been misleading. The actual number of unknowns are n*n pixels. Is 90000*90000 still too small? – Leon May 07 '17 at 18:33
It may not be surprising if a task which runs only 1 second and has limited parallelism according to the sequential nature of cg doesn't gain from parallelism. – tim18 May 08 '17 at 08:19
Right, `90000*90000` is large enough. Here, using your code, I get 0.75s without openmp and 0.45s with openmp. I did not disabled turboboost, so the sequential version run at higher clock rate. – ggael May 09 '17 at 09:25

Eigen 3.3 Conjugate Gradient is slower when multi-threaded with GCC compiler optimization

1 Answers1

Linked