Is it efficient to use boolean algebra and a passed argument to implement two versions of a similar kernel in CUDA?

Question

I have the following CUDA kernal, where a computationally expensive calculation is performed and then used in two operations.

Occasionally, I would like to run myKernel without operationOne. I know that code branching is generally a bad idea, but if all threads run the same branch, is there still a substantial inefficiency? i.e. is the following a bad idea?

__global__ void myKernel(bool doOpOne, ...) {
    // usefulValue is computed

    if(doOpOne) {
         // perform operation one
    }
    // perform operation two
}

score 5 · Accepted Answer · edited May 23 '17 at 12:07

It's not necessarily a bad idea. Based on what you have shown and your stipulations, there shouldn't be any significant warp divergence that I can see.

However, you might also consider templating as described here (I believe in fact your question is very nearly a duplicate of that one -- perhaps this one should be marked a duplicate of that one). That will allow you to create kernels for such a simple example (only two options) that are optimized at compile time and therefore will have no branching as a result of the usage of doOpOne.

Is it efficient to use boolean algebra and a passed argument to implement two versions of a similar kernel in CUDA?

1 Answers1