I use CUDA 5.0, and I want to compare matrix multiplication in C and cuBLAS. I already wrote a program in which matrix multiplication in C and cuBLAS both gave correct answers.
Now I want to compare their performance. For implementation in C, I used the clock()
, but I found that cutil doesn't exist in CUDA 5.0, so I used cudaEvent
. Both implementations use the same matrix, and in C, I just measured the time when C do the matrix multiplication, while in cuBLAS I began the measurement from createhandle
till destroyhandle
.
I got this result:
When C spends just 0.08ms, cuBLAS spend 59ms, and then I used clock()
to measure time for cuBLAS, cuBLAS became faster than C. I don't know whether the method I used to measure time is correct. Why do cudaevent
and clock()
give different answers?
I use cuBLAS, cudaevent just following Nvidia's documentation. I'm really puzzled about how to measure time correctly.