3

So, I have a code that uses Kinetic Monte Carlo on a lattice in order to simulate something. I am using CUDA to run this code on my GPU (although I believe the same question applies to OpenCl as well).

This means that I divide my lattice into little sub-lattices and each thread operates on one of them. Since I am doing KMC, each thread has this code :

   While(condition == true){
     *Grab a sample u from U[0,1]*
      for(i = 0; i < 100;i++){
         *Do some stuff here to generate A*
          if(A > u){
              *Do more stuff here, which could include updates to global memory*
               break();
           }
      }
   }

A is different for different threads and so is u and 100 is just a random number. In code, this could be 1000 or even 10000.

So, won't we have branch divergence when the time comes for a thread to pass through that if? How badly can this affect performance? I know that the answer depends on the code inside the if-clause but how will this scale as I add more and more threads?

Any reference on how I can estimate losses/gains in performance would also be welcome.

Thanks!

Kostis
  • 447
  • 4
  • 13
  • Will different threads have a different number of loop iterations (the 100/1000/10000 you cite)? – Brendan Wood Jun 11 '12 at 13:49
  • @Brendan Wood : No, they will all have the same but as soon as a thread enters the if block, that thread will break out of the loop regardless of the value of i. Oh, and then that thread will start again from the start. Perhaps I should edit my code sample to reflect this. – Kostis Jun 11 '12 at 14:18

1 Answers1

15

The GPU runs threads in groups of 32 threads, called warps. Divergence can only happen within a warp. So, if you are able to arrange your threads in such a way that the if condition evaluates the same way in the entire warp, there is no divergence.

When there is divergence in an if, conceptually, the GPU simply ignores the results and memory requests from threads in which the if condition was false.

So, say that the if evaluates to true for 10 of the threads in a particular warp. While inside that if, the potential compute performance of the warp is reduced from 100% to 10 / 32 * 100 = 31%, as the 22 threads that got disabled by the if could have been doing work but are now just taking up room in the warp.

Once exiting the if, the disabled threads are enabled again, and the warp proceeds with a 100% potential compute performance.

An if-else behaves in much the same way. When the warp gets to the else, the threads that were enabled in the if become disabled, and the ones that were disabled become enabled.

In a for loop that loops a different number of times for each thread in the warp, threads are disabled as their iteration counts reach their set numbers, but the warp as a whole must keep looping until the thread with the highest iteration count is done.

When looking at potential memory throughput, things are a little bit more complicated. If an algorithm is memory bound, there might not be much or any performance lost due to warp divergence, because the number of memory transactions may be reduced. If each thread in the warp was reading from an entirely different location in global memory (a bad situation for a GPU), time would be saved for each of the disabled threads as their memory transactions would not have to be performed. On the other hand, if the threads were reading from an an array that had been optimized for access by the GPU, multiple threads would be sharing the results from a single transaction. In that case, the values that were meant for the disabled threads were read from memory and then discarded together with the computations the disabled thread could have done.

So, now you probably have enough of an overview to be able to do pretty good judgement calls as to how much warp divergence is going to affect your performance. The worst case is when only a single thread in a warp is active. Then you get 1/32 = 3.125% of the potential for compute bound performance. Best case is 31/32 = 96.875%. For an if that is fully random, you get 50%. And as mentioned, memory bound performance depends on the change in the number of required memory transactions.

Roger Dahl
  • 15,132
  • 8
  • 62
  • 82
  • Pretty good answer! Thanks! And yes, we are talking for a fully random if. I guess I should also work in load-balancing my code then, so that most threads enter the if at the same time. – Kostis Jun 11 '12 at 15:39
  • What's said here about memory pipeline can also be said about arithmetic pipeline. Also, note that stuff is said in here about a warp. And divergence can happen inside a block, but outside of some warps, or even more, outside of any warps. So if you have to have a divergence, it's in your best interest to break it among warps and not threads inside same warp, as much as possible. – Íhor Mé Aug 17 '16 at 17:16
  • @ÍhorMé "divergence can happen inside a block, but outside of some warps, or even more, outside of any warps" In the context of CUDA, I have only seen divergence used to describe differences in program flow within a warp. – Roger Dahl Aug 17 '16 at 21:47
  • Every warp will have to execute the code that tests divergence. Nevertheless, a full warp can stay solid, not diverged after that. It's very important to know how warps are formed within the block (like [z][y][x] in C arrays) to be able to implement performant conditional code. – Íhor Mé Aug 19 '16 at 01:02
  • Also, optimizing compiler can completely "unroll" some if/case logic that is not dependent on a #thread or any other externalities, but is rather static. – Íhor Mé Aug 19 '16 at 23:01