I'm profiling my CUDA application, and I've come across something that I don't understand about the "Control Flow Divergence" metric that is present in the Visual Profiler.
According to the User Guide:
Control flow divergence gives the percentage of thread instructions that were not executed by all threads in the warp, hence causing divergence.
I've got the following code in my CUDA kernel:
int var;
var = tex2D(texture, x, y); // texture fetch
if(var < 0) {
var *= -1;
results[(blockIdx.x*blockDim.x) + threadIdx.x] = var; // global memory array
}
Here's what happens: not a single thread enters the branch (I checked the values in global memory), but the profiler states that control flow divergence is 34%. If on that same branch I insert a printf, then the value jumps to 43% (and oddly the execution time increases as well), despite nothing happening on stdout. Does this mean that the metric takes into account all of the kernel's instructions, even the ones not executed by any thread? (so effectively not having warp divergence)
On both cases the Divergent Branches metric is 0%.