Improving Matlab + CUSP MEX solution for A*x=B on CUDA GPU

Question

Matlab still can't compute sparse matrices on CUDA GPU. There are no such toolboxes (Jacket is discontinued) for that as well. That's why I am using CUSP integrated to Matlab through MEX file. However, my developed tool has two problems:

It is VERY unstable for big equation systems (actually beginning from only 100 elements),
It is tens or hundreds times slower than Matlab CPU alternative.

I'm solving A*x=b, where A is a sparse, symmetric matrix, b is a vector.

Hardware specs: Intel i7 3630QM, GT640M 2G, 8 GB DDR3. Software: Windows 8 64 bit, Matlab R2012b 64 bit, CUDA 5.0 64 bit, CUSP 0.3.1, Windows SDK v7.0, VS2010 compiler.

MEX code:

#include<cusp/csr_matrix.h>
#include <cusp/krylov/bicgstab.h>
#include <matrix.h>
#include <mex.h> 
#include <time.h>

void mexFunction(int nlhs,mxArray *plhs[],int nrhs,const mxArray *prhs[])
{
        double t1 =  clock();
          // data from Matlab       
        double *b = mxGetPr(prhs[1]);
        double *A = mxGetPr(prhs[0]);
        int n = mxGetM(prhs[0]);
        mwIndex *ir = mxGetIr(prhs[0]);
        mwIndex *jc = mxGetJc(prhs[0]);
        int N = jc[n];
        t1 = clock() - t1;

        double t2 =  clock();
          // initialization of matrix A in CSR format (jc and ir are exchanged, because Matlab uses CSC format
        cusp::csr_matrix<int,float,cusp::device_memory> Ag(n,n,3*n-2);
        thrust::copy(jc, jc + n + 1, Ag.row_offsets.begin());
        thrust::copy(ir, ir + N,     Ag.column_indices.begin());
        thrust::copy(A,  A  + N,     Ag.values.begin()); 
          // initialization of vector b
        cusp::array1d<float, cusp::device_memory> bg (b, b+n);
        cusp::array1d<float, cusp::device_memory> xg (n, 0);
        t2 = clock() - t2;

        double t3 =  clock();
          // bicgstab algorithm solution for vector x, when using 0.001 accuracy and precondition M
          // this is the slowest part, much slower than others
        cusp::verbose_monitor<float> monitor(bg, 5000, 1e-3);
        cusp::identity_operator<float, cusp::device_memory> M(n, n);
        cusp::krylov::bicgstab(Ag, xg, bg, monitor, M);        
        t3 = clock() - t3;

        double t4 =  clock();     
          // gathering solution vector bact on host to Matlab array T
        mxArray *T = mxCreateDoubleMatrix(n, 1, mxREAL);
        double *x  = mxGetPr(T);
        thrust::copy(xg.begin(), xg.end(), x);
        t4 = clock() - t4;

          // gathering execution times to Matlab array times
        mxArray *times=mxCreateDoubleMatrix(5, 1, mxREAL);
        double *timesb=mxGetPr(times);
        timesb[0]=t1; timesb[1]=t2; timesb[2]=t3; timesb[3]=t4; timesb[4]=monitor.iteration_count();

          // sending data back to Matlab
        plhs[0] = times; 
        plhs[1] = T;
}

Compile this code in MEX file (ex.cu) on Matlab using these commands (change second command for 32 bit if necessary):

>> !nvcc -c -arch sm_20 ex.cu -Xcompiler -fPIC -I "C:\Program Files\MATLAB\R2012b\extern\include" -I "C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\include
>> mex ex.obj -L"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0\lib\x64" -lcudart

Sample matrices, vectors and compiled 64 bit MEX function: http://www.failai.lt/3fqkhvoslxyt/sampleData.7z.htm

Use:

tic; [times,x]=ex(K',F); toc;   %K has to be transposed for CSR

where times - separate execution times, where last element - count of iterations (bicgstab monitor) used for a solution, result - the solution of K*x=F.

Results (http://www.failai.lt/rupaliln7kfb/results.7z.htm):

K_int_6, F_int_6 - ok
K_11, F_11 - x(1) wrong, others ok
K_100000, F_100000 - x(1) wrong, others from beginning are ok but later are decreasing comparing to correct result.
K_100000, F_100000 - execution lasts 0.6 s on GPU (MEX) while 0.014 s on CPU (tic;xcpu=K\F;toc;).

Could you look at that code, maybe try the MEX function, report about your results, suggest how to improve the function? Maybe you know any alternatives which enables sparce computations on GPU? I hope, it will be useful for everyone until Matlab releases its compatibility for sparse matrices on GPU :)

This probably isn't your entire problem but look here: `mxArray *T = mxCreateDoubleMatrix(n, 1, mxREAL); double *x = new double[n]; x = mxGetPr(T);`. I think this is incorrect. You don't need to allocate space for x, you can just leave it as a pointer. I also think this will cause a memory leak since you aren't using `mxcalloc` or explicitly `free`ing it. But you shouldn't be allocating space for it anyway since it's already been allocated with `mxCreateDoubleMatrix`. — JustinBlaber, Apr 09 '13 at 17:12
You are right. x shouldn't be initialized separately :) However, it doesn't change the problems. — Aurimas Šimkus, Apr 09 '13 at 17:31
Could you comment out the computational parts and only leave the memory transfers between the cpu ram and gpu ram to see how much time the transferring of the data costs? If that doesn't account for a large chunk of the time then we can be assured the computation is the bottle neck. — JustinBlaber, Apr 09 '13 at 18:01
With computation: t1=0,t2=8,t3=642,t4=2,whole(tic/toc)=0.657. With cusp::krylov::bicgstab(Ag, xg, bg, monitor, M); (computation) commented: t1=0,t2=7,t3=1,t4=3,whole(tic/toc)=0.016. So computation is the bottle neck. However, it is not the worst problem at the moment. Firstly, the computation results should be correct for all sizes of the system, but they aren't correct for now. — Aurimas Šimkus, Apr 09 '13 at 18:37
Well it appears the memory transfers (0.016s) are taking longer than just computing the problem on the cpu (0.014s) itself so it seems futile to even continue any further. Honestly with times these low I don't really see the point of using cuda. But anyway, I noticed your preconditioner is just the identity matrix, is this the best preconditioner to use for your matrix? Could you give an example of your matrix for like a 10x10 case? — JustinBlaber, Apr 09 '13 at 18:38
The real idea is to compute FEM time domain, so the part computation on GPU would be done maybe 100 or 1000 times and just after that the results would be returned to Matlab. I've just assigned the results file, its on http://www.failai.lt/rupaliln7kfb/results.7z.htm — Aurimas Šimkus, Apr 09 '13 at 18:59
For 6x6 results on CPU are (150,88.8364,61.7447,49.8237,44.7565,43.0078), GPU are (149.8132, 88.8364,61.744,49.8234,44.7544,43.0057). For 11x11, CPU: (150,138.7861,128.6942,119.6097,111.4294,104.0605,97.4169,91.4279,58.2982,46.8002,43.3421), GPU: (5.287,138.7996,128.7208,119.6478,111.4766,104.1161,97.4726,91.484,58.3094,46.7202,43.2098). For 100000x100000 its similar to 11x11 (first element is computed incorrectly), just further and further element results are wronger and wronger. You can see moo detailed results in the files on the links in the main post. — Aurimas Šimkus, Apr 09 '13 at 19:01
Working on other preconditions... `With cusp::precond::diagonal M(Ag);` the first x result value is corrected, however the further error is the same (huge), though using 0.01 accuracy and 103 iterations from 5000 allowed. Continuing investigation on preconditions... — Aurimas Šimkus, Apr 09 '13 at 19:13
Aurimas, you shouldn't be picking preconditioners at random. They need to approximate the matrix you're trying to invert. You said the matrix is symmetric, does it also happen to have any other structure such as positive definiteness or possibly a toeplitz structure? — JustinBlaber, Apr 09 '13 at 20:54
Matrices are symmetric and positive definiteness, but not necessarily toeplitz (just in these examples). I checked all preconditioners. It seems, `scaled_bridson_ainv M(Ag, .1, 3);` works best (mine matrix can have max 3 nonzeros on a row in the tests). Maybe it gives faster results, a little bit more accurate, but the problem consists - beginning from the first element result to the end, the result values are decreasing too fast. If x(100000) should be ~43, it is ~0. Bigger size, bigger problem, but I didn't see less than 0 (it should be from 150 to 43). Thanks for your help :) — Aurimas Šimkus, Apr 09 '13 at 21:38
Aurimas, you should be using the conjugate gradient (CG) method and not BICG. This is a very important consideration... — JustinBlaber, Apr 09 '13 at 22:06
I tried it. Results are not far comparing with BICGSTAB. Actually, I have just realized that precision in MONITOR means not result directly on floating point precision, but precision for iterations... So, if I am increasing the number of elements, I have to increase the precision... I think so. But, if its too big, I don't see the effect of increasing. Moreover, with big precision (<1e-5), computation consumes ~20s for 30000x30000, comparing with 0.004 on CPU. Rhetorical q. - is CUSP comparable to Matlab? Maybe my task isn't adequate... but common, its common sparse task, so CUSP should do it! — Aurimas Šimkus, Apr 10 '13 at 01:08
Have you tried VienaCL ? It works great for BiCGStab, GMRES, and CG on the GPU, and it already has a MATLAB interface. http://viennacl.sourceforge.net/viennacl-about.html — johnish, Apr 10 '13 at 02:09
Thanks, I didn't know this tool. Actually, it (the interface) crashes on Matlab as far as I tried. But I checked their test files... They are comparing VienaCL with PCG tool in Matlab, which realizes CG iterative method (I didn't know). I tried it. Quite similar results to my CUSP function. It needs to use small tolerance val. and lots of iteration if you want accurate res - it means a lot of time. It is much slower and more inaccurate comparing to mldivide.. which, it seems, is using direct methods on my matrices http://www.mathworks.se/help/matlab/math/sparse-matrix-operations.html#f6-9169 — Aurimas Šimkus, Apr 10 '13 at 14:40

score 0 · Answer 1 · answered Dec 02 '13 at 19:08

take a look at Matlab file exchange, cusp sparse class for gpus, support for single precision, real/complex: http://www.mathworks.com/matlabcentral/fileexchange/44423-gpu-sparse-accumarray-non-uniform-grid

sparse matrix vector multiply is overloaded with CUSP.

Improving Matlab + CUSP MEX solution for A*x=B on CUDA GPU

1 Answers1