Eigenvalue solver in parallel using CUDA

Question

I have been searching and searching the web and I cannot seem to find the answer I'm looking for. I have a particular problem.

I'm editing this in order to simply the problem and hope it is more readable and understandable.

Let's say I have 5000 20x20 symmetric, dense matrices. I would like to create a kernel in CUDA that will have each thread responsible for calculating the eigenvalues for each of the symmetric matrices.

Sample code of the CUDA kernel would be great if possible.

Any and all help/suggestions would be appreciated!

Thanks,

Johnathan

Jacobi is iterative, too. Perhaps the issue is your initial guess. If your guess is good, you'll need fewer iterations to converge to a solution. Are you only interested in eigenvalues, or do you need eigenvectors? — duffymo, Jun 30 '16 at 00:47
I only need the eigenvalues. There are papers written about performing Jacobi in parallel. That is what I was referring to above. Sorry for the confusion. To better explain my situation, I perform other calculations beforehand which deal with getting certain features of each point. I then calculate eigenvalues based on these features. The iterative part is only to calculate those eigenvalues. I hope this helps. Let me know if this helps! — Johnathan, Jun 30 '16 at 00:49
I'm not sure I understand what "points" means. Is that a synonym for the number of rows/columns in the square matrix? Bigger matricies mean more calculations - nothing will change that. — duffymo, Jun 30 '16 at 00:51
By points its like I am working on a 500 x 3 matrix. 500 is the number of points (rows) and 3 is dimension (columns). Essentially, I want to scale it up to a >100,000 x 3 matrix or more. — Johnathan, Jun 30 '16 at 00:53
Your comment confuses me. All the matricies I've ever used to calculate eigenvalues for are from physics; all were square. I would read this: https://math.stackexchange.com/questions/583938/do-non-square-matrices-have-eigenvalues. You might need a better algorithm suited for non-square matricies. Think SVD. — duffymo, Jun 30 '16 at 00:56
Here's a chapter on SVD: https://www.math.washington.edu/~greenbau/Math_554/Course_Notes/ch1.5.pdf — duffymo, Jun 30 '16 at 00:57
To better explain my earlier statement, the features I find help me generate a symmetric matrix for each point. I calculate the eigenvalues for the symmetric matrix for each point. I currently have a function from cuSolver (cusolverDnSgesvd) that calculates the SVD for me. It would be great if I could move that to the GPU but it is a host-only function but has underlying functionality that runs a kernel for me. I hope I explained that alright — Johnathan, Jun 30 '16 at 01:16
Nope, the more you explain the murkier it is. I can't help you. Good luck. — duffymo, Jun 30 '16 at 01:31
To anyone else, let's say I have a 500 x 3 matrix of points. This means it is 500 3-D points. Each of the 500 points has a symmetric NxN matrix associated with them. I would like help with a CUDA kernel that will have each thread (500 total GPU threads in this case) find the N eigenvalues for each of the 500 points in parallel. Any help is appreciated! — Johnathan, Jun 30 '16 at 02:50
@Johnathan: Please edit all this into your question. As written, it is long and rambling and very difficult to follow. And it would probably also help if you gave an idea of the value of N — talonmies, Jun 30 '16 at 05:43
It would be fairly easy to modify the code presented [here](http://people.sc.fsu.edu/~jburkardt/cpp_src/jacobi_eigenvalue/jacobi_eigenvalue.html) to run in a thread-parallel way on a CUDA GPU. — Robert Crovella, Jul 01 '16 at 07:34
Thank you Robert. I will try that and respond later if i can successfully convert it — Johnathan, Jul 03 '16 at 01:01

score 2 · Answer 1 · edited May 23 '17 at 12:17

I would like to create a kernel in CUDA that will have each thread responsible for calculating the eigenvalues for each of the symmetric matrices.

It's questionable to me whether this would be the fastest approach, but it might be for very small matrices. Even in this situation, there might be some data storage optimizations that could be made (interleaving global data across threads), but this would complicate things.

As stated, that request could be mapped into an "embarrassingly parallel" algorithm, where each thread works on a fully independent problem. We need only find suitable single threaded "donor code". After a quick google search, I came across this. It is fairly straightforward to modify that code to run in this thread-independent way. We need only borrow 3 routines (jacobi_eigenvalue, r8mat_diag_get_vector and r8mat_identity), and suitably decorate these routines with __host__ __device__ for use on the GPU, while making no other changes.

The code in question, appears to be GNU LGPL licensed by J Burkardt at Florida State University. Therefore, with this in view, and following conventional wisdom I have not included any significant amount of that code in this answer. But you should be able to reconstruct my results experimentally using the instructions I give.

NOTE: I'm not sure what the legal ramifications are of using this code, which claims to be GNU LGPL licensed. You should be sure to adhere to any necessary requirements if you elect to use this code or portions of it. My primary purpose in using it here is to demonstrate the concept of a relatively trivial "embarassingly parallel" extension of a single-threaded problem solver.

It should be trivial to reconstruct my full code by going here and copy-pasting the 3 indicated functions into the places indicated in the remaining code-skeleton. But this doesn't change any of the previously mentioned notices/disclaimers. Use it at your own risk.

Again, making no other changes might not be the best idea from a performance standpoint, but it results in a trivial amount of effort and can serve as a possibly useful starting point. Some possible optimizations could be:

seek out a data interleaving strategy so that adjacent threads are more likely to be reading adjacent data
eliminate the new and delete functions from the thread code, and replace it with a fixed allocation (this is easy to do)
remove unnecessary code - for example that which computes and sorts the eigenvectors, if that data is unneeded

In any event, with the above decorated donor code, we need only wrap a trivial kernel (je) around it, to launch each thread operating on separate data sets (i.e. matrices), and each thread produces its own set of eigenvalues (and eigenvectors - for this particular code base).

I've crafted it to work with just 3 threads and 3 4x4 matrices for test purposes, but it should be trivial to extend it to however many matrices/threads you wish.

For brevity of presentation, I've dispensed with the usual error checking, but I recommend you use it or at least run your code with cuda-memcheck if you make any modifications.

I've also built the code to adjust the device heap size upward to accommodate the in-kernel new operations, depending on number of matrices (ie. threads) and matrix dimensions. If you worked on the 2nd optimization mentioned above, you could probably remove this.

t1177.cu:

#include <stdio.h>
#include <iostream>
const int num_mat = 3; // total number of matrices = total number of threads
const int N = 4;   // square symmetric matrix dimension
const int nTPB = 256;  // threads per block

// test symmetric matrices

  double a1[N*N] = {
      4.0,  -30.0,    60.0,   -35.0, 
    -30.0,  300.0,  -675.0,   420.0, 
     60.0, -675.0,  1620.0, -1050.0, 
    -35.0,  420.0, -1050.0,   700.0 };

  double a2[N*N] = {
    4.0, 0.0, 0.0, 0.0, 
    0.0, 1.0, 0.0, 0.0, 
    0.0, 0.0, 3.0, 0.0, 
    0.0, 0.0, 0.0, 2.0 };

  double a3[N*N] = {
    -2.0,   1.0,   0.0,   0.0,
     1.0,  -2.0,   1.0,   0.0,
     0.0,   1.0,  -2.0,   1.0,
     0.0,   0.0,   1.0,  -2.0 }; 


/* ---------------------------------------------------------------- */
//
// the following functions come from here:
//
// https://people.sc.fsu.edu/~jburkardt/cpp_src/jacobi_eigenvalue/jacobi_eigenvalue.cpp
//
// attributed to j. burkardt, FSU
// they are unmodified except to add __host__ __device__ decorations
//
//****************************************************************************80
__host__ __device__
void r8mat_diag_get_vector ( int n, double a[], double v[] )
/* PASTE IN THE CODE HERE, FROM THE ABOVE LINK, FOR THIS FUNCTION */
//****************************************************************************80
__host__ __device__
void r8mat_identity ( int n, double a[] )
/* PASTE IN THE CODE HERE, FROM THE ABOVE LINK, FOR THIS FUNCTION */
//****************************************************************************80
__host__ __device__
void jacobi_eigenvalue ( int n, double a[], int it_max, double v[], 
  double d[], int &it_num, int &rot_num )
/* PASTE IN THE CODE HERE, FROM THE ABOVE LINK, FOR THIS FUNCTION */

// end of FSU code
/* ---------------------------------------------------------------- */

__global__ void je(int num_matr, int n, double *a, int it_max, double *v, double *d){

  int idx = threadIdx.x+blockDim.x*blockIdx.x;
  int it_num;
  int rot_num;
  if (idx < num_matr){
    jacobi_eigenvalue(n, a+(idx*n*n), it_max, v+(idx*n*n), d+(idx*n), it_num, rot_num);
  }
}

void initialize_matrix(int mat_id, int n, double *mat, double *v){

  for (int i = 0; i < n*n; i++) *(v+(mat_id*n*n)+i) = mat[i];
}

void print_vec(int vec_id, int n, double *d){

  std::cout << "matrix " << vec_id << " eigenvalues: " << std::endl;
  for (int i = 0; i < n; i++) std::cout << i << ": " << *(d+(n*vec_id)+i) << std::endl;
  std::cout << std::endl;
}
int main(){
// make sure device heap has enough space for in-kernel new allocations
  const int heapsize = num_mat*N*sizeof(double)*2;
  const int chunks = heapsize/(8192*1024) + 1;
  cudaError_t cudaStatus = cudaDeviceSetLimit(cudaLimitMallocHeapSize, (8192*1024) * chunks);
  if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "set device heap limit failed!");
    }
  const int max_iter = 1000;
  double *h_a, *d_a, *h_v, *d_v, *h_d, *d_d;
  h_a = (double *)malloc(num_mat*N*N*sizeof(double));
  h_v = (double *)malloc(num_mat*N*N*sizeof(double));
  h_d = (double *)malloc(num_mat*  N*sizeof(double));
  cudaMalloc(&d_a, num_mat*N*N*sizeof(double));
  cudaMalloc(&d_v, num_mat*N*N*sizeof(double));
  cudaMalloc(&d_d, num_mat*  N*sizeof(double));
  memset(h_a, 0, num_mat*N*N*sizeof(double));
  memset(h_v, 0, num_mat*N*N*sizeof(double));
  memset(h_d, 0, num_mat*  N*sizeof(double));
  initialize_matrix(0, N, a1, h_a);
  initialize_matrix(1, N, a2, h_a);
  initialize_matrix(2, N, a3, h_a);
  cudaMemcpy(d_a, h_a, num_mat*N*N*sizeof(double), cudaMemcpyHostToDevice);
  cudaMemcpy(d_v, h_v, num_mat*N*N*sizeof(double), cudaMemcpyHostToDevice);
  cudaMemcpy(d_d, h_d, num_mat*  N*sizeof(double), cudaMemcpyHostToDevice);
  je<<<(num_mat+nTPB-1)/nTPB, nTPB>>>(num_mat, N, d_a, max_iter, d_v, d_d);
  cudaMemcpy(h_d, d_d, num_mat*N*sizeof(double), cudaMemcpyDeviceToHost);
  print_vec(0, N, h_d);
  print_vec(1, N, h_d);
  print_vec(2, N, h_d);
  return 0;
}

compile and sample run:

$ nvcc -o t1177 t1177.cu
$ cuda-memcheck ./t1177
========= CUDA-MEMCHECK
matrix 0 eigenvalues:
0: 0.166643
1: 1.47805
2: 37.1015
3: 2585.25

matrix 1 eigenvalues:
0: 1
1: 2
2: 3
3: 4

matrix 2 eigenvalues:
0: -3.61803
1: -2.61803
2: -1.38197
3: -0.381966

========= ERROR SUMMARY: 0 errors
$

The output seems plausible to me, mostly matching the output here.

Eigenvalue solver in parallel using CUDA

1 Answers1