Converting C/C++ for loops into CUDA

Question

I would like to understand how to convert basic C/C++ loops to a CUDA kernel. Let's put it simple:

for (int i=0;i < MAXi;i++)
   for(int j=0;j< MAXj;j++){

       ...code that uses i and j....
   }

Every single i would need to compute MAXj elements. It could be very basic for some people but I am really struggling here. Let's say that Maxj is around a million, MAXj=1000000; and there is where we want all threads work. I have been successful with only the inner loop:

int tid=threadIdx.x + blockDim.x*blockIdx.x + blockDim.x*gridDim.x*blockIdx.y;

using 2d blocks, how can I parallelize this kind of loops? They are very common in C and it would be very useful to learn how to do it.

The answer depends completely on what "...code that uses i and j...." does. Add some further details to the question and it might be possible to get a useful answer. You might find [this thread](http://stackoverflow.com/questions/5643178/cuda-how-to-get-grid-block-thread-size-and-parallalize-non-square-matrix-calcul) worth looking at too. — talonmies, Jul 07 '11 at 16:08

score 3 · Accepted Answer · edited May 23 '17 at 12:33

one best way to divide these kinds of 2D loops is by using 1D blocks and Grids

dim3 blocks(MAXj, 1);
dim3 grids(MAXi, 1);

kernel<<<grids, blocks, 1>>>()

__global__ kernel()
{
   int i = blockIdx.x;
   int j = threadIdx.x;

   ...code that uses i and j....

}

The inner loop is been divided into threads and outer loop is divided into blocks (2D blocks)

if MAXj and MAXi are very large values, then you need to divide it into small groups and compute it. The code is quite similar to the one posted in this thread.

Converting C/C++ for loops into CUDA

1 Answers1

Linked