Are your matrices already on the GPU?
If not, CUBLAS might transfer them for you (known as thunking), which is an additional overhead.
Also, GPUs do not really shine for such small computations, i.e. it will probably be slower than CPUs since you have to transfer your result back.
If you can, use bigger matrices.
Otherwise you might want to use streams (cudaStream_t) to start multiple parallel computations on the GPU.
If you want to measure the execution time of a kernel in CUDA, you need to enclose that (or anything else that computes on the GPU) in events, like this when using the CUDA runtime API:
cudaEvent_t start, stop;
cudaEventRecord(&start);
struct timeval cpuStart, cpuEnd;
gettimeofday(&cpuStart, 0); // get start time on CPU
// Do something with CUDA on the GPU, e.g. call kernels, transfer memory, ...
gettimeofday(&cpuEnd, 0); // get end time on CPU
double seconds = cpuEnd.tv_sec - cpuStart.tv_sec;
double microseconds = cpuEnd.tv_usec - cpuStart.tv_usec;
double cpuDuration = (seconds * 1.0e6 + microseconds) / 1.0e3; // in milliseconds
cudaEventRecord(&stop);
// Wait until the stop event occurred
cudaError_t eventResult;
do
{
eventResult = cudaEventQuery(stop);
}
while (eventResult == cudaErrorNotReady);
// Assert there was no error; check the CUDA Toolkit Reference for further info
assert(cudaSuccess == eventResult); // requires #include <assert.h> or <cassert>
// Retrieve the time
float gpuDuration = 0.0; // in milliseconds
cudaEventElapsedTime(&gpuDuration, start, stop);
// Release the event objects
cudaEventDestroy(stop);
cudaEventDestroy(start);
You might want to check the error code of every call to CUDA (at least with an assert), as you may get errors from previous calls, resulting in hours of debugging...
(Note: I mostly use the CUDA driver API, so this might not work out of the box. Sorry for that.)
EDIT: Just saw that you want to measure the call itself, not the duration of the kernel.
You can do that by simply measuring the time on the CPU for the call - see the updated code above.
This works only on Linux because gettimeofday is not available for Windows (AFAIK).