Why cuSparse is much slower than cuBlas for sparse matrix multiplication

Question

Recently when I used cuSparse and cuBLAS in CUDA TOOLKIT 6.5 to do sparse matrix multiplication, I find cuSPARSE is much slower than cuBLAS in all cases!

In all my experiments, I used cusparseScsrmm in cuSparse and cublasSgemm in cuBLAS. In the sparse matrix, half of the total elements are zero. The GPU I used is NVIDIA Titan Black. Besides, all the time consumed is obtained through nvvp tool provided by NVIDIA. Below are some of the results:

Experiment A:

sparse matrix size: 192x2400
dense matrix size: 2400x256
cusparse time: 1.4ms
cublas time: 0.21ms

Experiment B:

sparse matrix size: 192x75
dense matrix size: 75x1024
cusparse time: 0.27ms
cublas time: 0.04ms

So, it's very odd to see the results listed above. Because cuSPARSE is designed particularly to handle sparse matrix manipulation, how could it be even slower than cuBLAS!? If so, then there is no need to use cuSPARSE at all. Could you please give me any explanation to the results? Also, could you suggest any other ways to speed up sparse matrix multiplication?

Sigi · Accepted Answer · 2015-05-20T19:11:52.390

I don't think that you can classify a matrix with half zeros as "sparse": the timing you have found are reasonable (actually the sparse algorithm is behaving pretty well!).

Sparse algorithms are efficient only when considering matrices where most of the elements are zeros (for example, matrices coming out from finite elements problems).

This holds true for CPUs, non only for GPUs: there's an important overhead in treating the matrix as sparse, and it become convenient to use sparse algorithms only when... most of the elements are zeros (typical: ten or less non-zeros per row, matrix of rank thousands - hundred thousands - (millions?) ).

There are other matrix shapes that have efficient solution algorithms, that you can try if it applies to your problem, e.g. band matrices. I don't know whether they have been ported to cuBlas though.

About the overheads

Dense linear algebra algorithms can perform optimally because processors have been designed in order to best efficiently solve for such systems. Consider the DGEMM operation (matrix-matrix multiply): it's an operation that let you use the processors at >95% of it's theoretical peak floating point performance, for large matrices (ie, matrices not fitting any cache of the system). How?

prefetching
optimal cache usage
vectorization (SSE, AVX)
pipelining

In a sparse LA algorithm only non-zero elements and their corresponding indexes are stored into memory: memory accesses are in fact indirect. So the sparse algorithm cannot exploit the hardware at the same level of optimization: I don't know about specific figures in this context, but 10 to 20% wouldn't be strange.

The gain is clearly that operations on zeros (on non-stored elements) are simply not performed, resulting in order of magnitudes less operations and much less needed storage.

There are further overheads in integers logics, conditionals, but modern CPUs are pretty good in overlapping integer and FP operations, and with "speculative executions". Unfortunately they too can prevent vectorization and so are further overheads with respect to the dense case.

What about GPUs?

Dense LA algorithm are an optimal fit for GPUs as the same as for CPUs: in this case you have optimal usage of:

coalescing
shared memory
memory access patterns

Again the indirect access to matrices elements in sparse LA algorithm prevent to exploit the same level of optimization.

References

I can't remember which one I used when encountered sparse problems... I think it was PSBLAS: http://people.uniroma2.it/salvatore.filippone/psblas/

But here you will be overwhelmed of them: http://www.netlib.org/utk/people/JackDongarra/la-sw.html

Thx a lot for your answer! Btw, could you give me a brief description of the "overhead" when treating a matrix to be sparse as you mentioned above? It's very appreciated if you also give me some related references! Thx! — ROBOT AI, May 19 '15 at 05:40
I added some information into the answer to address your further questions. — Sigi, May 21 '15 at 10:13
Pretty detailed answers! Thank you so much for your kind help! I still have two questions. The first one is to confirm that for my problem, the MKL's sparse matrix multiplication function will be slower than its dense counterpart in CPU case, right? Another question is: although strictly speaking my problem is not large-scale and truly sparse, whether there still exist any BLAS library or any approach to speed up the matrix multiplication? By the way, in my problem, the sparse matrix has no special structure and non-zero entries just distribute randomly in matrix. Thx again! — ROBOT AI, May 21 '15 at 13:56
yes, I guess that you will observe a slowdown with mkl too. About the other question, I don't know, it depends on the problem: there is A LOT of literature about methods useful to improve the solution of sparse matrices: probably better you ask in a more mathematically-oriented forum. — Sigi, May 22 '15 at 10:34
Thx a lot for your answer! Actually, now I wanna find some BLAS libraries or approaches that are able to take advantage of the sparsity in my own problem, thus making the matrix multiplication be faster than using the dense functions of BLAS libraries. By now, it seems that my best choice is still using the dense functions in BLAS libraries, right? — ROBOT AI, May 22 '15 at 13:26

Why cuSparse is much slower than cuBlas for sparse matrix multiplication

1 Answers1

About the overheads

What about GPUs?

References