2

I have read that comparisons and branching is slow on GPU. I would like to know how much. (I'm familier with OpenCL, but the question is general also for CUDA, AMP ... )

I would like to know it, before I start to port my code to GPU. In particular I'm interested in finding lowest value in neighborhood ( 4 or 9 nearest neighbors) of each point in 2D array. i.e. something like convolution, but instead of summing and multiplying I need comparisons and branching.

for example code like this ( NOTE: this example code is not yet optimized for GPU to be more readeable ... so partitioning to workgroups, prefeaching of local memory ... is missing )

for(int i=1;i<n-1;i++){ for(int j=1;j<n-1;j++){ // iterate over 2D array
     float hij  = h[i][j];
     int imin = 0,jmin = 0;
     float dh,dhmin=0;
     // find lowest neighboring element h[i+imin][j+jmin] of h[i][j]
     dh = h[i-1][j  ]-hij;  if( dh<dhmin ){  imin = -1; jmin =  0; dhmin = dh; }
     dh = h[i+1][j  ]-hij;  if( dh<dhmin ){  imin = +1; jmin =  0; dhmin = dh; }
     dh = h[i  ][j-1]-hij;  if( dh<dhmin ){  imin =  0; jmin = -1; dhmin = dh; }
     dh = h[i  ][j+1]-hij;  if( dh<dhmin ){  imin =  0; jmin = +1; dhmin = dh; }
     if( dhmin<-0.00001 ){ // if lower
       // ... Do something with hij, dhmin and save to h[i+imin][j+jmin] ...
     }
} }
  • Would it be worth to port to GPU despite a lot of if branching and comparison? ( i.e. if this 4-5 comparisons per elemet would be 10x slower than the same 4-5 comparisons on CPU it would be a bottleneck )
  • is there any optimization trick how to minizmize if branching and comparison slow down?

Which I used in this hydraulic errosion code: http://www.openprocessing.org/sketch/146982

Prokop Hapala
  • 2,424
  • 2
  • 30
  • 59
  • 1
    I would recommend having a look at @talonmies comment to the question [Is Branch Divergence Really So Bad?](http://stackoverflow.com/questions/17223640/is-branch-divergence-really-so-bad) which effectively summarizes the problematic point. – Vitality May 03 '14 at 21:08
  • 1
    For your case, go on, the ifs do not really divert so much. – DarkZeros May 04 '14 at 12:54

2 Answers2

5

Branching itself is not slow. Divergence is what gets you. GPUs compute multiple work items (typ. 16 or 32) in lock-step in "warps" or "wavefronts" and if different work items take different paths they all take all paths but gate writes based on which path they are on (using predicate flags)). So if your work items always (or mostly) branch the same way, you're good. If they don't the penalty can rob performance.

Dithermaster
  • 6,223
  • 1
  • 12
  • 20
1

If you need to do comparison and if the array length 'n' is really big then you can use reduction instead of sequential comparison. Reduction would do comparison in parallel in O (log n) time as opposed to O (n) when done sequentially.

When you access memory sequentially in a GPU thread, the memory accesses are sequential since consecutive blocks are accessed from the same bank. Instead, its good to use coalesced reads. You can find plethora of examples on this.

On GPUs, don't access global memory multiple times (as GPU memory management and caching work not exactly like a CPU). Instead, cache the global memory elements into thread's private variables / shared memory as much as possible.

Yogi Joshi
  • 786
  • 1
  • 6
  • 19
  • hi, yes, as I tried to explain here `( NOTE: this code is just java, it is not yet optimized for GPU )` I know about that memory magement (global/local memory, workgroups, tiles ... ) I just didn't write it here to make the code more readable. Here I' want to focus realy just on the cost of the 4-5 comparisons per element relative to 4-5 floaing point operations in normal convolution – Prokop Hapala May 03 '14 at 18:22