2

While doing some basic examples of CUDA made by NVIDIA I copied some code to test the speedup from CPU to GPU computing for matrix multiplication.

After 30 minutes looking the results and seeing my CPU (yes CPU) doing 1000 times faster computations than my GPU I realised that the timing was not working correctly. A snipped of the code looks like (this is code from NVIDIA):

//Create timers
cudaEvent_t start;
cudaEvent_t stop;
float simpleKernelTime;
float optimisedKernelTime;

//start timer
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);

matrixMultKernel<<<grid, block >>>(a_d, b_d, c_d, N);

cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsedTime, start, stop);

// Print time and do other things

cudaEventRecord(start, 0);

matrixMultCPU(a_h, b_h, d_, N);

cudaEventRecord(stop, 0)
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsedTime, start, stop);

// Print time

This code works fine on a Linux machine (I copied the same code as the person next to me and he was getting good timing) but on a Windows 8 Machine with Visual Studio 2013, the timing on the CPU part (second half of snipped) was not working (always gave ~0.003ms).

Why is this happening? I fixed it using <time.h> (removing cudaEventRecord() calls and using standard C code timing approaches), so I don't want to know how to fix it, but more why is this happening.

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
Ander Biguri
  • 35,140
  • 11
  • 74
  • 120
  • @buttifulbuttefly nononono, I removed `cudaEventRecord` calls and used standard C timing. – Ander Biguri Jun 25 '15 at 09:36
  • 1
    About the close vote: "This code *is working* and I know how to make it work properly. I am not asking for code debugging help, its a good theoretical question I believe. – Ander Biguri Jun 25 '15 at 09:54
  • 3
    On Linux and Windows TCC driver work is submitted directly from the driver into the GPU push buffer. On Windows WDDM driver work is submitted into a software queue. When this overflows the work is submitted to the WDDM kernel mode driver in a command buffer and the driver submits the full command buffer to the GPU. If you add the call cudaEventQuery(0) after cudaEventRecord(start...) you should see behavior closer to Linux as this will flush the queue. That said, do not use cudaEventRecord or clock to time CPU clock. Use the platforms high precision timer. – Greg Smith Jun 25 '15 at 20:28

1 Answers1

4

From what I understand CUDA events are not designed to measure CPU-only (host-only) time per se, but rather kernel execution and CUDA API calls. From the CUDA C Programming Guide 3.2.5.6. Events (emphasis mine):

The runtime also provides a way to closely monitor the device's progress, as well as perform accurate timing, by letting the application asynchronously record events at any point in the program and query when these events are completed.

I am also suprised that you get any time (kernel launches are asynchronous), as your code is missing cudaEventSynchronize():

cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsedTime, start, stop);

See also How to Implement Performance Metrics in CUDA C/C++.

For CPU-only time measurement see this thread.

EDIT:

To get correct time for matrixMultCPU() you need add synchronization for start event:

cudaEventRecord(start, 0);
cudaEventSynchronize(start);
Community
  • 1
  • 1
Grzegorz Szpetkowski
  • 36,988
  • 6
  • 90
  • 137
  • Opps, my mistake! I do use `cudaEventSynchronize(stop);` in my original code. However, the `cudaEventRecord()` works on CPU in other systems/compilers (not sure why). I mean, the code is written by NVIDIA, not me, and I have seen it properly time in other OS/compilers, is just not working on my system. – Ander Biguri Jun 25 '15 at 09:49
  • Maybe my question sould have been formulated in the opposite then: Why does `cudaEventRecord()` work for non-GPU code timing on Linux with `nvcc` ? – Ander Biguri Jun 25 '15 at 09:53
  • @AnderBiguri: I edited my answer. See if that works for you. Probably the GNU/Linux implementation synchronizes `start` implicitely. – Grzegorz Szpetkowski Jun 25 '15 at 10:12
  • I see, thanks. I will leave this open for a while because I am kind of interested in why is this happening . – Ander Biguri Jun 25 '15 at 10:37