1

I am trying to increase the performance of a sparse matrix-vector product using OpenMP in C++. I stored the sparse matrix in COO format, that is, I have a struct of 3 arrays that correspond to each nonzero entry of the sparse matrix. For each index of the struct, I can find the row index, column index and value of the nonzero entry. In addition, I can specify the number of threads to be used by the function by

export OMP_NUM_THREADS=n

where n is the number of threads I would like to use. Currently, my code is as follows

  void ompMatvec(const Vector& x, Vector& y) const {

    std::string envName = "OMP_NUM_THREADS";
    std::string thread_count = getenv(envName); 
    int thread_count2 = atoi(thread_count.c_str()); 
    Vector y2(y.numRows()); 
    size_type k; 
    #pragma omp parallel num_threads(thread_count2) 
    #pragma omp for 
      for (k = 0; k < arrayData.size(); ++k) {
        y(rowIndices[k]) += arrayData[k] * x(colIndices[k]);
      }
  }

However, when I measure the performance I find that my speed up is not too high. I am comparing the parallelized function above with:

  void matvec(const Vector& x, Vector& y) const {
    for (size_type k = 0; k < arrayData.size(); ++k) {
      y(rowIndices[k]) += arrayData[k] * x(colIndices[k]);
    }
  }

I would like to mention that I have created a Vector class with the private member function .numRows() which essentially provides the length of the vector. I am also running the code on 4 cores. Is there an implementation change that could increase performance using the OpenMP API? Or is it limited by the number of cores that my program is running on?

Any and all recommendations are greatly appreciated. Thank you!

Update: An attempt to avoid the race condition above:

 void ompMatvec(const Vector& x, Vector& y) const {

    std::string envName = "OMP_NUM_THREADS";
    std::string thread_count = getenv(envName); 
    int thread_count2 = atoi(thread_count.c_str());  
    size_type k; 
    #pragma omp parallel num_threads(thread_count2) \
    default(none) private(k) 
    Vector y2(y.numRows());
    #pragma omp for 
      for (k = 0; k < arrayData.size(); ++k) {
        y2(rowIndices[k]) += arrayData[k] * x(colIndices[k]);
      }
    #pragma omp critical
      for(k = 0; k < y.numRows(); ++k){
        y(k) += y2(k); 
      }
  }
KangHoon You
  • 123
  • 4
  • Assuming that there are duplicates within `rowIndices`, the code is outright wrong (race condition). There are certain approaches to parallelizing the problem, but I believe it will depend vastly on the sparsity of the data and your system how they perform. For any performance question you should always give your specific performance observations ("speed up is not too high" is terribly vague), measurement methods, hardware specification, compiler versions and options as well as a proper [mcve]. – Zulan May 15 '17 at 08:47
  • @Zulan Okay then, I guess I will restart from the basics. Could you explain to me why there will be a race condition? – KangHoon You May 15 '17 at 09:06
  • `y(rowIndices[k]) +=` happens concurrently on multiple threads: http://stackoverflow.com/questions/34510/what-is-a-race-condition – Zulan May 15 '17 at 09:10
  • @Zulan I tried to introduce a private variable so that the threads can alter these variables and in the critical section there values are merged. Does this solve the race condition? – KangHoon You May 15 '17 at 09:18
  • 1
    Yes, this is a reduction and solves the race condition. – Zulan May 15 '17 at 09:58
  • Do you need to write your own code here? Intel's MKL Is now gratis (https://software.intel.com/en-us/articles/free-mkl) , and can handle many different sparse matrix formats https://software.intel.com/en-us/node/471374 (FWIW, I work for Intel, but not on MKL) – Jim Cownie May 16 '17 at 08:54
  • Your last attempt is not parallel at all as the scope of the `parallel` region is the structured block that follows it. In that case the structured block is the single statement `Vector y2(y.numRows());`. In other news, parsing `OMP_NUM_THREADS` is done wrong as the standard allows it to be a list of values. Use `omp_get_max_threads()` instead. – Hristo Iliev May 16 '17 at 13:03
  • Also, sparse matrix-vector multiplication is a memory-bound problem and has poor scalability on CPUs with small caches and narrow memory interfaces, which includes many mobile and low-end desktop CPUs. Such problems scale with the memory bandwidth, e.g., with the number of sockets given modern CPUs with their own memory channels. – Hristo Iliev May 16 '17 at 13:08

1 Answers1

0

As explained by @Zulan in his comments, your first implementation of ompMatvec will yield race conditions and simply compute garbage.

As far as I understand, there are two ways you can parallelize the COO matvec with OpenMP.

The first one is to use an atomic operation to force synchronizations. A possible implementation of the COO matvec is then given by:

void dcoomv_atomic(SparseMatrixCOO *A, double *x, double *y) {
  for (int i=0; i<A->m; i++) y[i] = 0.;
  #pragma omp parallel for
  for (int i=0; i<A->nnz; i++) {
    #pragma omp atomic
    y[A->row_idx[i]] += A->val[i] * x[A->col_idx[i]];
  }
}

This, unlike your first implementation, will work. However, the cost of synchronization is very significant. In my case, when I run this code with 16 threads, I get no speedup, but rather 7X slowdown compared to the serial implementation. So this approach is useless in practice.

Second, as mentioned by @Zulan in his comments, you can do a reduction. OpenMP 4.5+ supports array reduction for C/C++. An implementation of the COO matvec with array reduction is given as follows:

void dcoomv_builtin_array_reduction(SparseMatrixCOO *A, double *x, double *y) {
  for (int i=0; i<A->m; i++) y[i] = 0.;
  #pragma omp parallel for reduction(+:y[:A->m])
  for (int i=0; i<A->nnz; i++) {
    y[A->row_idx[i]] += A->val[i] * x[A->col_idx[i]];
  }
}

This code compiles fine. However, I get a segmentation fault during execution. So, clearly, I'm doing something wrong. If someone knows how I should actually proceed to use the built-in array reduction feature of OpenMP, please, let me know. Note that I compiled with the -fopenmp flag using the Intel C compiler icc 2021.6.0, which does support OpenMP 4.5 and a subset of OpenMP 5.0.

Since I struggle to make OpenMP's built-in array reduction work, I can implement array reduction from scratch as follows:

void dcoomv_array_reduction_from_scratch(SparseMatrixCOO *A, double *x, double *y) {
  for (int i=0; i<A->m; i++) y[i] = 0.;
  double *YP;
  #pragma omp parallel 
  {
    int P = omp_get_num_threads();
    int p = omp_get_thread_num();
    #pragma omp single
    {
      YP = (double*)mkl_malloc(A->m * P * sizeof(double), sizeof(double));
      for (int i=0; i<A->m*P; i++) YP[i] = 0.;
    }
    #pragma omp for
    for (int i=0; i<A->nnz; i++) {
      YP[p * A->m + A->row_idx[i]] += A->val[i] * x[A->col_idx[i]];
    }
    #pragma omp for
    for (int i=0; i<A->m; i++) {
      for (int p=0; p<P; p++) {
        y[i] +=  YP[A->m * p + i];
      }
    }
  }
  mkl_free(YP);
}

This code does work. However, it is also inefficient as, when using 16 threads, instead of getting a speedup compared to the serial implementation, I get about 3X slowdown.

Like you, I found this question interesting because, judging from my own benchmarking experience, it seemed as if the MKL function mkl_sparse_d_mv was multithreaded for the CSR and BSR storage formats, but not for COO and CSC.

However, after having played around enough, I understand that multithreading the COO (and CSC) matvec is not straightforward, or at least, it does not easily convert into speedups, even at the cost of extra memory allocations. Perhaps this is why MKL developers decided not to parallelize mkl_sparse_d_mv for the COO and CSC sparse matrix formats.