5

I understand that branching in CUDA is not recommended as it can adversely affect performance. In my work, I find myself having to implement large switch statements that contain upward of a few dozen cases.

Does anyone have any idea how badly this will affect performance. (The official documentation isn't very specific) Also does anyone have a more efficient way of handling this portion?

Bart
  • 19,692
  • 7
  • 68
  • 77
gamerx
  • 579
  • 5
  • 16
  • 1
    It depends whether every thread in a warp typically takes different paths due to the switch or whether you have e.g. one switch case that is much more common then the rest. The pattern will determine how much warp divergence you get. – Paul R Jun 25 '12 at 08:28
  • 1
    As I know, switches are usually are mapped to short jump instruction not as sequence of ifs. In c++ switch is usually more effective then multiple if statement. I think you need to investigate CUDA instruction mapping in this case before final suggestion. – geek Jun 25 '12 at 08:33
  • 1
    What is inside each `case` statement? Is it a one-linear? Or several lines of code? Posting an example will probably get you more specific answers. – Pedro Jun 25 '12 at 09:01
  • @Pedro In each case, there will be several lines of code. Bitwise operation will be done to check values of some bitstrings and some variables will be updated. – gamerx Jun 25 '12 at 09:18

2 Answers2

9

The GPU runs threads in groups of 32, called warps. Whenever different threads in a warp go through different paths in the code, the GPU has to run the entire warp multiple times, once for each code path.

To deal with this issue, called warp divergence, you want to arrange your threads so that the threads in a given warp go through as few different code paths as possible. When you have done that, you pretty much just have to bite the bullet and accept the loss in performance caused by any remaining warp divergence. In some cases, there might not be anything you can do to arrange your threads. If so, and if the different code paths are a big part of your kernel or overall workload, the task may not be a good fit for the GPU.

It doesn't matter how you implement the different code paths. if-else, switch, predication (in PTX or SASS), branch tables or anything else -- if it comes down to the threads in a warp running in different paths, you get a hit on performance.

It also doesn't matter how many threads go through each path, just the total number of different paths in the warp.

Here is another answer on this that goes into a bit more detail.

Community
  • 1
  • 1
Roger Dahl
  • 15,132
  • 8
  • 62
  • 82
2

A good way to avoid multiple switches is to implement function table and select function from table by index based in you switch condition. CUDA allows you to use function pointers on __device__ function in kernels.

geek
  • 1,809
  • 1
  • 12
  • 12
  • Thanks, that sounds like exactly what I need! – gamerx Jun 25 '12 at 09:47
  • Stupid question perhaps, but what effect would this have on performance? This would only circumvent the need to write a large switch-case statement, unless I'm missing the point... – Bart Jun 25 '12 at 10:17
  • 8
    How does this help to reduce warp divergence and associated performance penalties ? – Paul R Jun 25 '12 at 12:23
  • @PaulR: no way. Just choose the path will be reduced to a call the function. If the algorithm is constructed so that the warp threads in the need to call different functions of the divergence is not unavoidable. Avoid the divergence can be with reducing the algorithm to non-branching – geek Jun 25 '12 at 12:32
  • 2
    Compiler have many, many implementation options that they can employ in translating switch statements to executable code. Everything from the machine equivalent to a series of if/then/else clauses, to a computed branch or lookup table with range checking. Or a combination of more than one of those strategies. So before anyone can speculate as to how CUDA hardware will handle your switch statements, we need to know more about what the switch statements are trying to accomplish. – ArchaeaSoftware Jun 25 '12 at 15:41
  • @ArchaeaSoftware Where might I find these comiler options? – gamerx Jun 26 '12 at 02:18
  • 3
    He means the compiler may choose to implement a switch statement in many ways. He's not talking about compiler command line options. I don't think that the suggestion solution of using a function table is really going to help -- you still need the branching. Besides, you could implement the function table with a class hierarchy with inheritance and virtual functions -- then the compiler would do it all for you. That doesn't mean it would be faster though. – harrism Jun 26 '12 at 05:26
  • @geek Can you give me examples ? – Omer Belal Jul 03 '17 at 13:31