I had test application that performs matrix multiplication and tried to offload to gpu with nvblas.
#include <armadillo>
#include <iostream>
using namespace arma;
using namespace std;
int main(int argc, char *argv[]) {
int m = atoi(argv[1]);
int k = atoi(argv[2]);
int n = atoi(argv[3]);
int t = atoi(argv[4]);
std::cout << "m::" << m << "::k::" << k << "::n::" << n << std::endl;
mat A;
A = randu<mat>(m, k);
mat B;
B = randu<mat>(k, n);
mat C;
C.zeros(m, n);
cout << "norm c::" << arma::norm(C, "fro") << std::endl;
tic();
for (int i = 0; i < t; i++) {
C = A * B;
}
cout << "time taken ::" << toc()/t << endl;
cout << "norm c::" << arma::norm(C, "fro") << std::endl;
}
I compiled the code as follows.
CPU
g++ testmm.cpp -I$ARMADILLO_INCLUDE_DIR -lopenblas -L$OPENBLAS_ROOT/lib/ --std=c+11 -o a.cpu.out
GPU
g++ testmm.cpp -I$ARMADILLO_INCLUDE_DIR -lopenblas -L$OPENBLAS_ROOT/lib/ --std=c+11 -lnvblas -L$CUDATOOLKIT_HOME/lib64 -o a.cuda.out
When I run the a.cpu.out and a.cuda.out with 4096 4096 4096 both of them taking same time around 11 seconds. I am not seeing a reduction in time with a.gpu.out. In the nvblas.conf, I am leaving everything to default except (a)changing the path for the openblas (b)auto_pin memory enabled. I am the seeing nvblas.log saying using "Devices 0" and no other output. The nvidia-smi is not showing any increase in the gpu activity and nvprof shows a bunch of cudaMalloc's, cudamemcpy, query device capability etc. But any gemm call is not present.
The ldd on the a.cuda.out shows it is linked with nvblas, cublas, cudart and the cpu openblas library. Am I making any mistakes here?