I have a program that, when after profiled with nvprof, says that ~98% of the execution time is devoted to cudaDeviceSynchronize. In thinking about how to optimize the following code, I'm brought back here to try and confirm my understanding of the need for cudaDeviceSynchronize.
The general layout of my program is thus :
Copy input array to GPU.
program<<<1,1>>>(inputs)
Copy outputs back to host.
Thus, my program kernel is a master thread that basically looks like this :
for (int i = 0; i < 10000; i++)
{
calcKs(inputs);
takeStep(inputs);
}
The calcKs function is one of the most egregious abusers of cudaDeviceSynchronize and look like this :
//Calculate k1's
//Calc fluxes for r = 1->(ml-1), then for r = 0, then calc K's
zeroTemps();
calcFlux<<< numBlocks, numThreads >>>(concs, temp2); //temp2 calculated from concs
cudaDeviceSynchronize();
calcMonomerFlux(temp2, temp1); //temp1 calculated from temp2
cudaDeviceSynchronize();
calcK<<< numBlocks, numThreads >>>(k1s, temp2); //k1s calculated from temp2
cudaDeviceSynchronize();
where arrays temp2, temp1 and k1s are each calculated from the results of each other. My understanding was that cudaDeviceSynchronize was essential because I need temp2 to be completely calculated before temp1 is calculated and same for temp1 and k1s.
I feel like I've critically misunderstood the function of cudaDeviceSynchronize from reading this post : When to call cudaDeviceSynchronize?. I'm not sure how pertinent the comments on there are to my situation, however, as all of my program is running on the device and there's no CPU-GPU interaction until the final memory copy back to host, hence I don't get the implicit serialization caused by the memCpy