cudaDeviceSynchronize and performing sequential work

Question

I have a program that, when after profiled with nvprof, says that ~98% of the execution time is devoted to cudaDeviceSynchronize. In thinking about how to optimize the following code, I'm brought back here to try and confirm my understanding of the need for cudaDeviceSynchronize.

The general layout of my program is thus :

Copy input array to GPU.
program<<<1,1>>>(inputs)
Copy outputs back to host.

Thus, my program kernel is a master thread that basically looks like this :

for (int i = 0; i < 10000; i++)
{
    calcKs(inputs);
    takeStep(inputs);
}

The calcKs function is one of the most egregious abusers of cudaDeviceSynchronize and look like this :

//Calculate k1's
//Calc fluxes for r = 1->(ml-1), then for r = 0, then calc K's
zeroTemps();
calcFlux<<< numBlocks, numThreads >>>(concs, temp2);        //temp2 calculated from concs
cudaDeviceSynchronize();
calcMonomerFlux(temp2, temp1);                              //temp1 calculated from temp2
cudaDeviceSynchronize();
calcK<<< numBlocks, numThreads >>>(k1s, temp2);             //k1s calculated from temp2
cudaDeviceSynchronize();

where arrays temp2, temp1 and k1s are each calculated from the results of each other. My understanding was that cudaDeviceSynchronize was essential because I need temp2 to be completely calculated before temp1 is calculated and same for temp1 and k1s.

I feel like I've critically misunderstood the function of cudaDeviceSynchronize from reading this post : When to call cudaDeviceSynchronize?. I'm not sure how pertinent the comments on there are to my situation, however, as all of my program is running on the device and there's no CPU-GPU interaction until the final memory copy back to host, hence I don't get the implicit serialization caused by the memCpy

score 4 · Accepted Answer · answered Aug 15 '14 at 19:32

CUDA activities (kernel calls, memcopies, etc.) issued to the same stream will be serialized.

When you don't use streams at all in your application, everything you are doing is in the default stream.

Therefore, in your case, there is no functional difference between:

calcFlux<<< numBlocks, numThreads >>>(concs, temp2);        //temp2 calculated from concs
cudaDeviceSynchronize();
calcMonomerFlux(temp2, temp1);                              //temp1 calculated from temp2

and:

calcFlux<<< numBlocks, numThreads >>>(concs, temp2);        //temp2 calculated from concs
calcMonomerFlux(temp2, temp1);                              //temp1 calculated from temp2

You don't show what calcMonomerFlux is, but assuming it uses data from temp2 and is doing calculations on the host, it must be using cudaMemcpy to grab the temp2 data before it actually uses it. Since the cudaMemcpy will be issued to the same stream as the preceding kernel call (calcFlux) it will be serialized, i.e. it will not begin until calcFlux is done. Your other code depending on temp2 data in calcMonomerFlux presumably executes after the cudaMemcpy, which is a blocking operation, so it will not begin executing until the cudaMemcpy is done.

Even if calcMonomerFlux contains kernels that operate on temp2 data, the argument is the same. Those kernels are presumably issued to the same stream (default stream) as calcFlux, and therefore will not begin until calcFlux is complete.

So the cudaDeviceSynchronize() call is almost certainly not needed.

Having said that, cudaDeviceSynchronize() by itself should not consume a tremendous amount of overhead. The reason that most of your execution time is being attributed to cudaDeviceSynchronize(), is because from a host thread perspective, this sequence:

calcFlux<<< numBlocks, numThreads >>>(concs, temp2);        //temp2 calculated from concs
cudaDeviceSynchronize();

spends almost all its time in the cudaDeviceSynchronize() call. The kernel call is asynchronous, meaning it launches the kernel and then immediately returns control to the host thread, allowing the host thread to continue. Therefore the overhead in the host thread for a kernel call may be as low as a few microseconds. But the cudaDeviceSynchronize() call will block the host thread until the preceding kernel call completes. The longer your kernel executes, the more time the host thread spends waiting at the cudaDeviceSynchronize() call. So nearly all your host thread execution time appears to be spent on these calls.

For properly written single threaded, single (default) stream CUDA codes, cudaDeviceSynchronize() is almost never needed in the host thread. It may be useful in some cases for certain types of debugging/error checking, and it may be useful in the case where you have a kernel executing and want to see the printout (printf) from the kernel before your application terminates.

It's worth adding that `cudaDeviceSynchronize()` is also required when using CUDA features that give you a single pointer to data, such as Unified Memory. If you want to use a managed pointer to access data on the host after altering on the device, you must synchronize first. — Jez, Aug 15 '14 at 19:37
Good point. Unified Memory is one example, and zero-copy memory is another. — Robert Crovella, Aug 15 '14 at 19:39
Here's the thing, and I was not clear enough about it, this code is all running on the GPU. I'll edit to make it clearer. — Hair of Slytherin, Aug 15 '14 at 20:14
Yes, in that case, the `cudaDeviceSynchronize()` calls are probably necessary. — Robert Crovella, Aug 15 '14 at 21:02

cudaDeviceSynchronize and performing sequential work

1 Answers1