I am currently implementing Stochastic Gradient Descent on a GPU using CUDA, Thrust and cuBLAS.
In my initial implementation I used plain CUDA to perform matrix-vector operations, and now I'm trying to optimize this by using cuBLAS for such operations instead.
What I'm observing currently is that for matrices of size rows x cols, for small number of cols, plain CUDA consistently outperforms cuBLAS, apparently regardless of the number of rows. For large number of cols however, the cuBLAS implementation wins out.
So I was wondering: Are there any rules of thumb/guidelines about what should be the minimal dimensions of matrices/vectors after which using BLAS or cuBLAS will be better performing than plain C/CUDA, or is this completely dependent on the application/BLAS function?