Efficiently use Eigen for repeated sparse matrix assembly in nonlinear finite element code

Question

I am trying to use Eigen to efficiently assemble a Stiffness matrix for non-linear finite element computations.

From my finite element discretization I can exactly extract my sparsity pattern. Therefore I can just use:

mat.reserve(nnz);
mat.setFromTriplets(TripletList.begin(), TripletList.end());

as proposed in http://eigen.tuxfamily.org/dox/group__SparseQuickRefPage.html.

My questions that arise here are:

Due to the non-linear nature I have to refill my matrix very often. Therefore should I store than again all contribution in a triplet and reuse mat.setFromTriplets(...) again and again?
If I reuse mat.setFromTriplets(...) can I somehow exploit the fact that I evaluated my element matrices for the assembly always in the same order and therefore my indices in the triplet never change but only the value. Therefore the "search in memory" can be circumvented since I can maybe store the place where to put it in a new Array?
If mat.coeffRef(i,j) is faster can I maybe exploit the aforementioned fact?
One extra question: (Lower priority) Is it possible to store and assemble efficiently 3 matrices with the same sparsity pattern, i.e. if I have to do it in a loop? For example a matrix wrapper where i have one SparseMatrix to get the matrices as M1=mat[0], M2=mat[1], M3=mat[2], where mat[i] return the first matrix and M1,M2 and M3 are e.g. SparseMatrix<double> M1(1000,1000).-

The general setup is the following (for question 1.-3. only M1 appears):

std::vector< Eigen::Triplet<double> > tripletListA; // triplets differ only in the values and not in the indices
std::vector< Eigen::Triplet<double> > tripletListB;
std::vector< Eigen::Triplet<double> > tripletListC;

SparseMatrix<double> M1(1000,1000); 
SparseMatrix<double> M2(1000,1000);
SparseMatrix<double> M3(1000,1000);

//Reserve space in triplets
tripletListA.reserve(nnz);
tripletListB.reserve(nnz);
tripletListC.reserve(nnz);



//Reserve space in matrices
M1.reserve(nnz);
M2.reserve(nnz);
M3.reserve(nnz);

//fill triplet list with zeros

M1.setFromTriplets(tripletListA.begin(), tripletListA.end());
M2.setFromTriplets(tripletListB.begin(), tripletListB.end());
M3.setFromTriplets(tripletListC.begin(), tripletListC.end());
for (int i=0; i<1000; i++) {

   //Fill triplets

   M1.setFromTriplets(tripletListA.begin(), tripletListA.end()); //or use coeffRef?
   M2.setFromTriplets(tripletListB.begin(), tripletListB.end());
   M3.setFromTriplets(tripletListC.begin(), tripletListC.end());

//solve
//update
}

Thank you and regards, Alex

UPDATE:

Thank you for your answers. Initially the order of my access to the nonzeros is quite arbitrary. But since i'm interested in an iterative scheme i think about documenting this random sorting and construct an operator which takes care of this. This operator can be constructed (at least in my mind) from the initially constructed triplet.

SparseMatrix<double> mat(rows,cols);
std::vector<double> valuevector(nnz);
//Initially construction 
std::vector< Eigen::Triplet<double> > tripletList;

//naive fill of tripletList

//Sorting of entries and identifying double entries in tripletList from col and row values
//generating from this information operator P

for (int i=0; i<1000; i++) 
{
  //naive refill of tripletList

  valuevector= P*tripletList.value(); //constructing vector in efficient ordering from values of triplets (tripletList.value() call does not makes since for std::vector but i hope it is clear what i have in mind

  for (int k=0; k<mat.outerSize(); ++k)
    for (SparseMatrix<double>::InnerIterator it(mat,k); it; ++it)
          it.valueRef() =valuevector(it);
}

I think about the operator P just as a matrix with ones and zeros at the appropiate places.

The question remains if this is even a more efficient procedure?

UPDATE-2: Benchmark:

I tried to construct my ideas in a code snippet. I first generate a random triplet list. This list is constructed to get a sparsity of 95% and additionally some values in the list are duplicated to mimic dubplicates in the triplet list whic hwrite on the same position in the sparse matrix. These values are then inserted based on different concepts. The first one is the setfromtriplet approach and the second and third tries to exploit the known structure.

The second and third approach documents the ordering of the triplet list. This information is then exploited to directly write the values in the pure mat1.coeffs() vector.

#include <iostream>
#include <Eigen/Sparse>
#include <random>
#include <fstream>
#include <chrono>

using namespace std::chrono;
using namespace Eigen;
using namespace std;

typedef Eigen::Triplet<double> T;


void findDuplicates(vector<pair<int, int> > &dummypair, Ref<VectorXi> multiplicity) {
  // Iterate over the vector and store the frequency of each element in map
  int pairCount = 0;
  pair<int, int> currentPair;
  for (int i = 0; i < multiplicity.size(); ++i) {
    currentPair = dummypair[pairCount];
    while (currentPair == dummypair[pairCount + multiplicity[i]]) {
      multiplicity[i]++;
    }
    pairCount += multiplicity[i];
  }
}

typedef Matrix<duration<double, std::milli>, Dynamic, Dynamic> MatrixXtime;

int main() {


  //init random generators
  std::default_random_engine gen;
  std::uniform_real_distribution<double> dist(0.0, 1.0);

  int sizesForTest = 5;
  int measures = 6;
  MatrixXtime timeArray(sizesForTest, measures);
  cout << "TripletTime NestetTime LNestedTime " << endl;
  for (int m = 0; m < sizesForTest; ++m) {


    int rows = pow(10, m + 1);
    int cols = rows;
    std::uniform_int_distribution<int> distentryrow(0, rows - 1);
    std::uniform_int_distribution<int> distentrycol(0, cols - 1);

    std::vector<T> tripletList;
    SparseMatrix<double> mat1(rows, cols);
//  SparseMatrix<double> mat2(rows,cols);
//  SparseMatrix<double> mat3(rows,cols);

    //generate sparsity pattern of matrix with  10% fill-in
    tripletList.emplace_back(3, 0, 15);
    for (int i = 0; i < rows; ++i)
      for (int j = 0; j < cols; ++j) {
        auto value = dist(gen);                         //generate random number
        auto value2 = dist(gen);                         //generate random number
        auto value3 = dist(gen);                         //generate random number
        if (value < 0.05) {
          auto rowindex = distentryrow(gen);
          auto colindex = distentrycol(gen);
          tripletList.emplace_back(rowindex, colindex, value);      //if larger than treshold, insert it

          //dublicate every third entry to mimic entries which appear more then once
          if (value2 < 0.3333333333333333333333)
            tripletList.emplace_back(rowindex, colindex, value);

          //triple every forth entry to mimic entries which appear more then once
          if (value3 < 0.25)
            tripletList.emplace_back(rowindex, colindex, value);
        }
      }
    tripletList.emplace_back(3, 0, 9);

    int numberOfValues = tripletList.size();

    //initially set all matrices from triplet to allocate space and sparsity pattern
    mat1.setFromTriplets(tripletList.begin(), tripletList.end());
//  mat2.setFromTriplets(tripletList.begin(), tripletList.end());
//  mat3.setFromTriplets(tripletList.begin(), tripletList.end());

    int nnz = mat1.nonZeros();
    //reset all entries back to zero to fill in later
    mat1.coeffs().setZero();
//  mat2.coeffs().setZero();
//  mat3.coeffs().setZero();

    //document sorting of entries for repetative insertion
    VectorXi internalIndex(numberOfValues);
    vector<pair<int, int> > dummypair(numberOfValues);

    VectorXd valuelist(numberOfValues);
    for (int l = 0; l < numberOfValues; ++l) {
      valuelist(l) = tripletList[l].value();
    }

    //init internalindex and dummy pair
    internalIndex = Eigen::VectorXi::LinSpaced(numberOfValues, 0.0, numberOfValues - 1);
    for (int i = 0; i < numberOfValues; ++i) {

      dummypair[i].first = tripletList[i].col();
      dummypair[i].second = tripletList[i].row();
    }

    auto start = high_resolution_clock::now();


// sort the vector  internalIndex based on the dummypair
    sort(internalIndex.begin(), internalIndex.end(), [&](int i, int j) {
        return dummypair[i].first < dummypair[j].first ||
               (dummypair[i].first == dummypair[j].first && dummypair[i].second < dummypair[j].second);
    });

    auto stop = high_resolution_clock::now();
    timeArray(m, 3) = (stop - start) / 1000;


    start = high_resolution_clock::now();
    sort(dummypair.begin(), dummypair.end());
    stop = high_resolution_clock::now();
    timeArray(m, 4) = (stop - start) / 1000;


    start = high_resolution_clock::now();
    VectorXi dublicatecount(nnz);
    dublicatecount.setOnes();
    findDuplicates(dummypair, dublicatecount);
    stop = high_resolution_clock::now();
    timeArray(m, 5) = (stop - start) / 1000;

    dummypair.clear();




    //calculate vector containing all indices of triplet
    //therefore vector[k] is the vectorXi containing the entries of triples which should be written at dof k
    int indextriplet = 0;
    int multiplicity = 0;

    vector<VectorXi> listofentires(mat1.nonZeros());
    for (int k = 0; k < mat1.nonZeros(); ++k) {
      multiplicity = dublicatecount[k];
      listofentires[k] = internalIndex.segment(indextriplet, multiplicity);
      indextriplet += multiplicity;
    }


    //========================================
    //Here the nonlinear analysis should start and everything beforehand is prepocessing

    //Test1 from triplets
    start = high_resolution_clock::now();

    mat1.setFromTriplets(tripletList.begin(), tripletList.end());

    stop = high_resolution_clock::now();
    timeArray(m, 0) = (stop - start) / 1000;

    mat1.coeffs().setZero();


    //Test2 use internalIndex but calculate listofentires on the fly
    indextriplet = 0;
    start = high_resolution_clock::now();

    for (int k = 0; k < mat1.nonZeros(); ++k) {
      multiplicity = dublicatecount[k];
      mat1.coeffs()[k] += valuelist(internalIndex.segment(indextriplet, multiplicity)).sum();
      indextriplet += multiplicity;
    }

    stop = high_resolution_clock::now();
    timeArray(m, 1) = (stop - start) / 1000;
    mat1.coeffs().setZero();

    //Test3 directly use listofentires
    start = high_resolution_clock::now();
    for (int k = 0; k < mat1.nonZeros(); ++k)
      mat1.coeffs()[k] += valuelist(listofentires[k]).sum();

    stop = high_resolution_clock::now();
    timeArray(m, 2) = (stop - start) / 1000;


    std::ofstream file("test.txt");
    if (file.is_open()) {
      file << mat1 << '\n';
    }
    cout << "Size: " << rows << ": ";
    for (int n = 0; n < measures; ++n)
      cout << timeArray(m, n).count() << " ";
    cout << endl;
  }

  return 0;
}

If i run this example on my i5-6600K 3.5Ghz and 16GB ram i end up with the following results. which are the times in seconds.

  Size Triplet   Nested LessNested  Sort_intIndex Sort_dum_pair findDuplica
    10   1e-06    1e-06      2e-06          1e-06         1e-06       1e-06 
   100 2.8e-05    4e-06    1.4e-05          5e-05       4.2e-05       1e-05 
  1000   0.003 0.000416   0.001489        0.01012       0.00627    0.000635 
 10000   0.426 0.093911    0.48912         1.5389      0.780676    0.061881 
100000 337.799  99.0801    37.3656        292.397       87.4488     0.79996

The first three columns denote the calculation time of the different approaches and column 4 to 6 denote the times for different preprocessing steps.

For the size of 100000 rowsand coloumns my Ram gets full relatively fast and therefore the last table entry should be taken with care. Here the fastest method changes from 2 to three.

My questions here are is this approach going in the correct direction to improve the efficiency? Is this a complete wrong direction because for example for the case of a size of 10000 an assemble time of 0.48s seems a bit high?

Additionally the preprocessing steps are getting expensive very fast and is there a better way to construct the ordering of the matrix? Finally as last question is the benchmarking done in the correct way?

Thanks for your time, Alex

I don't know the `Eigen` internals well enough, but: 1-3. If your sparsity pattern does not change, using `setFromTriplets` is almost surely slower than setting the values directly (mind the differences regarding e.g. duplicates though), 4. I would wager that copy-constructing a sparse matrix from one where the pattern is already set would be faster than constructing it from scratch (is it still faster when you have to replace the values afterwards? I don't know...). In the end, your safest bet is to test your options and profile them - I think the odds are good that your ideas will work out. — Max Langhof, Dec 05 '19 at 10:33
But I would also suggest reading up on how the sparse matrix data is organized under the hood in `Eigen`. If you want good performance you'll have to care about that eventually. — Max Langhof, Dec 05 '19 at 10:37
We cannot answer without knowing how you are able to iterate over the non zeros. See [this](https://eigen.tuxfamily.org/dox-devel/group__TutorialSparse.html#title3) for some initial insights. If the structure does not change, the fastest way would be to [iterate](https://eigen.tuxfamily.org/dox-devel/group__TutorialSparse.html#title2) over all non-zeros of the matrix and use `it.valueRef() = ...;` — ggael, Dec 05 '19 at 21:06
Thank you for your answers. Depending on your feedback i refined my question. — rath3t, Dec 06 '19 at 08:12

Efficiently use Eigen for repeated sparse matrix assembly in nonlinear finite element code

0 Answers0