Eigen, parallel ConjugateGradient failed: More threads, more costs

Question

I want to use a parallel ConjugateGradient in Eigen 3.3.7 (gitlab) to solve Ax=b, but it showed that more threads, more computational costs. I test the code in this question and change the matrix dimension from 90000 to 9000000. Here is the code (I name the file as test-cg-parallel.cpp):

    // Use RowMajor to make use of multi-threading
typedef SparseMatrix<double, RowMajor> SpMat;
typedef Triplet<double> T;

// Assemble sparse matrix from
// https://eigen.tuxfamily.org/dox/TutorialSparse_example_details.html
void insertCoefficient(int id, int i, int j, double w, vector<T>& coeffs,
                       VectorXd& b, const VectorXd& boundary)
{
  int n = int(boundary.size());
  int id1 = i+j*n;
  if(i==-1 || i==n) b(id) -= w * boundary(j); // constrained coefficient
  else  if(j==-1 || j==n) b(id) -= w * boundary(i); // constrained coefficient
  else  coeffs.push_back(T(id,id1,w));              // unknown coefficient
}

void buildProblem(vector<T>& coefficients, VectorXd& b, int n)
{
  b.setZero();
  ArrayXd boundary = ArrayXd::LinSpaced(n, 0,M_PI).sin().pow(2);
  for(int j=0; j<n; ++j)
    {
      for(int i=0; i<n; ++i)
        {
          int id = i+j*n;
          insertCoefficient(id, i-1,j, -1, coefficients, b, boundary);
          insertCoefficient(id, i+1,j, -1, coefficients, b, boundary);
          insertCoefficient(id, i,j-1, -1, coefficients, b, boundary);
          insertCoefficient(id, i,j+1, -1, coefficients, b, boundary);
          insertCoefficient(id, i,j,    4, coefficients, b, boundary);
        }
    }
}

int main()
{
  int n = 3000;  // size of the image
  int m = n*n;  // number of unknowns (=number of pixels)
  // Assembly:
  vector<T> coefficients;          // list of non-zeros coefficients
  VectorXd b(m);                   // the right hand side-vector resulting from the constraints
  buildProblem(coefficients, b, n);
  SpMat A(m,m);
  A.setFromTriplets(coefficients.begin(), coefficients.end());
  // Solving:
  // Use ConjugateGradient with Lower|Upper as the UpLo template parameter to make use of multi-threading
  clock_t time_start, time_end;
  time_start=clock();
  ConjugateGradient<SpMat, Lower|Upper> solver(A);
  VectorXd x = solver.solve(b);         // use the factorization to solve for the given right hand side

  time_end=clock();
   cout<<"time use:"<<1000*(time_end-time_start)/(double)CLOCKS_PER_SEC<<"ms"<<endl;
   return 0;
}

I compile the code with gcc 7.4.0, Intel Xeon E2186G CPU with 6 cores(12 threads), compile and run details are as follows:

liu@liu-Precision-3630-Tower:~/test$ g++ test-cg-parallel.cpp -O3 -fopenmp -o cg
liu@liu-Precision-3630-Tower:~/test$ OMP_NUM_THREADS=1 ./cg

time use:747563ms

liu@liu-Precision-3630-Tower:~/test$ OMP_NUM_THREADS=4 ./cg

time use: 1.49821e+06ms

liu@liu-Precision-3630-Tower:~/test$ OMP_NUM_THREADS=8 ./cg

time use: 2.60207e+06ms

Can anyone give me some advices? Thanks a lot.

Your way to measure time is not correct. `clock` seems to measure the parallel time and not the wall clock time here. Prefer using the OpenMP omp_get_wtime method or even the C++ `std::chrono::steady_clock`. The actual wall clock time is decreasing slightly with the number of threads. Note however that Eigen scale badly in this case. You could use rather [PLASMA](https://icl.utk.edu/plasma/overview/index.html). This is an alternative to LAPACK specifically designed to scale (as much as possible) one multi-core machines. — Jérôme Richard, Mar 02 '20 at 09:25
@ Jérôme Richard Thank you for your reply . I did as your suggestion and found the efficiency slightly increased with the number of threads (from 1 to 6 threads). And I wonder whether there is method to improve the multi-thread efficiency for Eigen. I want to solve linear equations with a sparse matrix using iteration methods, is PLASMA applicable for me ? — Richard LIU, Mar 03 '20 at 02:47
Apparently, PLASMA does not support that. You can check the list of libraries that support this [here](http://www.netlib.org/utk/people/JackDongarra/la-sw.html). I heard that [PETSc](https://www.mcs.anl.gov/petsc/) can do that, but I never used it. Also, as far as I know, PETSc uses MPI internally so it might be more difficult to use (although probably faster). You can start by looking [here](https://www.mcs.anl.gov/petsc/documentation/linearsolvertable.html) to know if PETSc implements exactly what you want. — Jérôme Richard, Mar 06 '20 at 16:32

Eigen, parallel ConjugateGradient failed: More threads, more costs

0 Answers0