The GPU runs threads in groups of 32, called warps. Whenever different threads in a warp go through different paths in the code, the GPU has to run the entire warp multiple times, once for each code path.
To deal with this issue, called warp divergence, you want to arrange your threads so that the threads in a given warp go through as few different code paths as possible. When you have done that, you pretty much just have to bite the bullet and accept the loss in performance caused by any remaining warp divergence. In some cases, there might not be anything you can do to arrange your threads. If so, and if the different code paths are a big part of your kernel or overall workload, the task may not be a good fit for the GPU.
It doesn't matter how you implement the different code paths. if-else
, switch
, predication (in PTX or SASS), branch tables or anything else -- if it comes down to the threads in a warp running in different paths, you get a hit on performance.
It also doesn't matter how many threads go through each path, just the total number of different paths in the warp.
Here is another answer on this that goes into a bit more detail.