How large should matrices be if I use BLAS/cuBLAS for it to perform better than plain C/CUDA?

Question

I am currently implementing Stochastic Gradient Descent on a GPU using CUDA, Thrust and cuBLAS.

In my initial implementation I used plain CUDA to perform matrix-vector operations, and now I'm trying to optimize this by using cuBLAS for such operations instead.

What I'm observing currently is that for matrices of size rows x cols, for small number of cols, plain CUDA consistently outperforms cuBLAS, apparently regardless of the number of rows. For large number of cols however, the cuBLAS implementation wins out.

So I was wondering: Are there any rules of thumb/guidelines about what should be the minimal dimensions of matrices/vectors after which using BLAS or cuBLAS will be better performing than plain C/CUDA, or is this completely dependent on the application/BLAS function?

Related question: http://stackoverflow.com/q/26417475/209882 — Bar, Feb 05 '16 at 16:15
Note that BLAS2 (matrix-vector) operations tend to be limited by memory throughput. If possible, you would want to use BLAS3 operations. There are many different BLAS2 operations, each with their own performance characteristics (which may further differ by GPU architecture) so your question seems too broad. Check whether any of the batched operations are applicable to your use case, as they offer better performance for small matrices which otherwise only use a portion of the machine resources. — njuffa, Feb 05 '16 at 16:16
You do exactly what was in the question you linked to - benchmark for your problem size domain and hardware and use that data to drive your heuristics. I am *very* tempted to close this as duplicate of that question. — talonmies, Feb 07 '16 at 06:35
@talonmies I was wondering if anyone already had experience with this. I have run the benchmarks and posted as an answer, I hope that is OK. — Bar, Feb 23 '16 at 11:49

score 1 · Accepted Answer · answered Feb 23 '16 at 11:48

I have run a few benchmarks which I will post here: The results are for linear regression task running for 10 iterations of SGD, on datasets with 10000 rows. The implementation and more results are available here: https://github.com/thvasilo/cuda-sgd-sese-project

Runtimes for 10-100 features/columns:

So for my implementation the change-point at which plain CUDA becomes slower is at 50 columns. There is a jump in runtime for 100 features for cuBLAS, but that could be an artifact, these experiments were only run once and the differences are not that large anyway.

When running with more columns BLAS Lvl. 2 consistently performs better:

How large should matrices be if I use BLAS/cuBLAS for it to perform better than plain C/CUDA?

1 Answers1