2

I have a sparse linear system Ax = b. In my application, A is a symmetric sparse matrix with typical size around 2,500,000 x 2,500,000, with non-zeros on main diagonal and on another diagonal (plus the symmetric to this one). This makes it 2-3 non-zeros per row/col. 

To test my code, I am comparing MATLAB and Eigen. I created a 1,000,000 x 1,000,000 sparse matrix A. In MATLAB, I simply use x = A\b and it takes about 8 seconds. In Eigen, I have tried several solvers. SuperLU takes about 150 s. SimplicialCholesky takes about 300 seconds. UmfPackLU takes about 490 s. These times are too long for me; on real data, it just takes too long to be useful. Other solvers give completely different results compared to MATLAB, iterative solvers took too long. SimplicialCholesky, SuperLU and UmfPackLU give similar (they differ at decimal places), so I hope this counts as same.   Eigen code: 

// prepare sparse matrix A
    std::vector<T> tripletList; // I am leaving filling the triplet list out
    Eigen::SparseMatrix<float> A(k, k); // k is usually around 2500000, in the test case I described here it is 1000000
    A.setFromTriplets(tripletList.begin(), tripletList.end());
    A.makeCompressed();

// prepare vector b
    Eigen::Map<Eigen::VectorXf> b; // vector b is filled with values

// calculate A x = b and measure time - for SimplicialCholesky
    t1 = std::chrono::steady_clock::now();
    Eigen::SimplicialCholesky<Eigen::SparseMatrix<float>> solver_chol(A);
    x = solver_chol.solve(b);
    t2 = std::chrono::steady_clock::now();
    log_file << "SimlicialCholeskytime: t2 - t1 = " << std::chrono::duration_cast<std::chrono::seconds>(t2 - t1).count() << " s \n";

// calculate A x = b and measure time - for SparseLU
    t1 = std::chrono::steady_clock::now();
    Eigen::SparseLU<Eigen::SparseMatrix<float>> solver_slu(A);
    x = solver_slu.solve(b);
    t2 = std::chrono::steady_clock::now();
    log_file << "SparseLU time: t2 - t1 = " << std::chrono::duration_cast<std::chrono::seconds>(t2 - t1).count() << " s \n";

// calculate A x = b and measure time - for UmfPackLU - here I had to convert to double.
    Eigen::SparseMatrix<double> Ad = A.cast <double>();
    Ad.makeCompressed();
    Eigen::VectorXd bd = b.cast <double>();
    t1 = std::chrono::steady_clock::now();
    Eigen::UmfPackLU<Eigen::SparseMatrix<double>> solver(Ad);
    Eigen::VectorXd xd = solver.solve(bd);
    t2 = std::chrono::steady_clock::now();
    log_file << "UmfPackLU time: t2 - t1 = " << std::chrono::duration_cast<std::chrono::seconds>(t2 - t1).count() << " s \n";

Perhaps I should mention that calculation runs on all 8 cores, so when I watch the time, I get 8 times, which I sum up. Also, the calculation is (so far) wrapped in .dll library .cu, it will be parallelized via CUDA in the next step. I measured times for all methods separately in order to avoid some counting overlapping.

I found the following possible solutions to speed up the calculation:

Is there anything I can do to speed up calculations using Eigen, so it takes a similar time as MATLAB? Am I using the correct solver, regarding the size and sparsity of matrix? Am I using current solvers correctly? Do I have to do some additional setup, include some other libraries? If it is not possible, are there some other libraries I could use? 

I am working on Windows 10, 64bit machine. I have Visual Studio 2019. 

Adriaan
  • 17,741
  • 7
  • 42
  • 75
Dalecanka
  • 21
  • 1
  • 1
    [This Stack Overflow question](https://stackoverflow.com/q/6058139/5211833) is probably relevant. Long story short: MATLAB is rather expensive, because it pays a lot of engineers to optimise their library beyond what any other programming language can offer. I doubt you can beat MATLAB's speed on your own, using open-source libraries. I'd say your best bet is to somehow use a mathematical/computational trick which exploits the specific structure/pattern of your sparse matrix, such as symmetry, positive-definiteness etc. – Adriaan Sep 08 '20 at 12:42
  • 2
    “compiler - maximum optimizations, favor speed“ — Try different optimization levels, sometimes -O2 is faster than -O3. Also turn on architecture-specific options, in GCC this is -march, and can make a big difference. – Cris Luengo Sep 08 '20 at 13:18
  • 1
    “when I watch the time, I get 8 times, which I sum up” — This makes no sense. You’re basically multiplying your timing by 8? Do you do this for MATLAB too? Your code shows you use `steady_clock`, which measures wall time. This is the fair value to compare. – Cris Luengo Sep 08 '20 at 13:22
  • Thank you. My aim is not really to beat Matlab, I am more looking for a way to improve my solution in C++/CUDA. The times are way too slow and from what I have read, they could be faster than I currently have. I am a beginner and I feel I might miss something because of lack of experience. – Dalecanka Sep 08 '20 at 13:29
  • Im sure CUDA is a good tool for this. But providing a full solution of a sparse solver is too much for here! Check cuSparse and its friends. – Ander Biguri Sep 08 '20 at 13:52
  • I talk about timing, because you say MATLAB is 20x as fast as your solution. But if you fudged timing, we’re actually only 2.5x off. 20x indicates you’ve picked the wrong algorithm. 2.5x is something related to optimizations, quality of implementation, etc. It’s a different problem altogether. – Cris Luengo Sep 08 '20 at 14:19
  • Also, if you plan to do a CUDA implementation, you should start there. Which algorithm is most efficient in a CUDA implementation might be unrelated to which is most efficient in a CPU implementation. – Cris Luengo Sep 08 '20 at 14:25
  • Your timings for C++ include the construction of the solver, if it does any copying, that will be expensive. How have you compiled it? You have not shown any multi-threading in C++ (I don't think you can achieve this through compiler flags alone), the Matlab execution engine will run these types of operations across multiple threads (maybe cores). Does the `::makeCompressed()` affect the performance? Are the linear solvers using the similar tolerances? – Mansoor Sep 08 '20 at 22:29
  • Your matrix is block-tridiagonal (with pure diagonal blocks), solving this "by hand" should be relatively straight-forward (and does not even require to store the matrix as a sparse matrix). If you use generic sparse solvers, they will probably do some crazy pivoting which introduces a lot of fill-in. Matlab probably somehow detects and exploits the special structure of the matrix. – chtz Sep 09 '20 at 01:06
  • It *might* already help a bit, if you use `SimplicialLDLT,Lower,NaturalOrdering<> >`, but I did not try this. – chtz Sep 09 '20 at 01:12
  • There’s [a very simple algorithm](https://en.m.wikipedia.org/wiki/Tridiagonal_matrix_algorithm) to solve a system of equations represented as a tridiagonal matrix. – Cris Luengo Sep 12 '20 at 14:59
  • Dear all, thank you all for oyur comments. It took me some time to respond, sorry, but I went through all your suggestions. I ended up with two findings: solving via CUDA solver is the fastest, bud copying to and from GPU took a lot of time, so other options were better. The best result I got was with Eigen library and ConjugateGradient solver, so I went with this. However, I think it still can be done much more efficient using CUDA some more clever way than I did. – Dalecanka May 25 '21 at 08:03

1 Answers1

0

I have tried many linear solvers recently for my spectral collocation solver, and I found that the "armadillo" is the fast one that solves dense Ax=b based on openblas library. Eigen3.3 is very slow even with "setNumbthreads", I still can't find the reason. If you want to solve it with Cuda or OpenMP. I strongly suggest you to use paralution library. it works fine for my problem. Regards

http://www.paralution.com/

ztdep
  • 343
  • 1
  • 4
  • 17
  • Please include a link to the Paralution library. I had never heard of it, and thought it was a typo! – Cris Luengo Sep 12 '20 at 14:52
  • “armadillo is the fast one that solves dense Ax=b based on openblas library” this is irrelevant: OP has a very sparse matrix, and could likely not fit a dense matrix of the same size in memory. – Cris Luengo Sep 12 '20 at 14:54
  • Hi, thank you all for your comments. I mentioned my final outcome in the comment in the main post. Regarding Paralution library, I struggled with implementing it in my code. :( – Dalecanka May 25 '21 at 08:09