While doing some basic examples of CUDA made by NVIDIA I copied some code to test the speedup from CPU to GPU computing for matrix multiplication.
After 30 minutes looking the results and seeing my CPU (yes CPU) doing 1000 times faster computations than my GPU I realised that the timing was not working correctly. A snipped of the code looks like (this is code from NVIDIA):
//Create timers
cudaEvent_t start;
cudaEvent_t stop;
float simpleKernelTime;
float optimisedKernelTime;
//start timer
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);
matrixMultKernel<<<grid, block >>>(a_d, b_d, c_d, N);
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsedTime, start, stop);
// Print time and do other things
cudaEventRecord(start, 0);
matrixMultCPU(a_h, b_h, d_, N);
cudaEventRecord(stop, 0)
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsedTime, start, stop);
// Print time
This code works fine on a Linux machine (I copied the same code as the person next to me and he was getting good timing) but on a Windows 8 Machine with Visual Studio 2013, the timing on the CPU part (second half of snipped) was not working (always gave ~0.003ms).
Why is this happening? I fixed it using <time.h>
(removing cudaEventRecord()
calls and using standard C code timing approaches), so I don't want to know how to fix it, but more why is this happening.