0

I have the next for nested loop and I would like to port it to CUDA to be run on a GPU

int current=0;
int ptr=0;

for (int i=0; i < Nbeans; i++){
  for(int j=0;j< NbeamletsPerbeam[i];j++){

     current = j + ptr;

     for(int k=0;k<Nmax;k++){
        ......
     }

     ptr+=NbeamletsPerbeam[i];
 }   
}

I would be very happy if any body has an idea of how to do it or how can be done. We are talking about Nbeams=5, NbeamletsPerBeam around 200 each.

This is what I currently have but I am not sure it is right...

 for (int i= blockIdx.x; i < d_params->Nbeams; i += gridDim.x){
            for (int j= threadIdx.y; j < d_beamletsPerBeam[i]; j+= blockDim.y){
                 currentBeamlet= j+k;
                 for (int ivoxel= threadIdx.x; ivoxel < totalVoxels; ivoxel += blockDim.x){
Manolete
  • 3,431
  • 7
  • 54
  • 92
  • Can you elaborate what the problem is? How many beamlets? Will it fit into per thread cache? – whoplisp Jul 03 '11 at 18:47
  • I would suggest reading a good CUDA tutorial; no-one is going to write your code for you! – Oliver Charlesworth Jul 03 '11 at 18:48
  • @Oli I do not expect someone else writes the code for me, I only want to know how to make those loops fitting into a GPU kernel. And yes, a good CUDA tutorial would be great, but I have not found any tutorial talking about nested for loops translation – Manolete Jul 03 '11 at 18:51
  • 1
    Nested loops? Matrix multiplication comes to mind. http://www.ncsa.illinois.edu/~kindr/projects/hpca/files/NCSA_GPU_tutorial_d3.pdf – whoplisp Jul 03 '11 at 19:03
  • 1
    It's not clear to me if you realize that by starting the threads on the GPU you already initiate a number of loops, e.g. one thread for each fragment. The value threadId tells you where you are in the image. – whoplisp Jul 03 '11 at 19:09
  • How big is Nmax? Currently you have parallelism of 5 * 200 * Nmax. If Nmax is small (< 10), then you don't have a ton of parallelism. If it is larger than that, you have a lot more and therefore more flexibility in how you parallelize it. – harrism Jul 04 '11 at 00:00
  • Nmax is around 600, what would you suggest to do instead? – Manolete Jul 04 '11 at 08:40
  • To be honest, I do not think I can achive a good paralelization on the way I have implemented it as is current the main index used inside of the loops, that is what concern me the most... – Manolete Jul 04 '11 at 08:42

1 Answers1

1

I would suggest this idea. But you might need to do some minor modifications based on your code.

dim3 blocks(NoOfThreads, 1);
dim3 grid(Nbeans, 1);

kernel<<grid, blocks, 1>>()

__global__ kernel()
{
   int noOfBlocks = ( NbeamletsPerbeam[blockIdx.x] + blockDim.x -1)/blockDim.x;

   for(int j=0; j< noOfBlocks;j++){
     //  use threads and compute.... 
     if( (threadIdx.x * j) < NbeamletsPerbeam[blockIdx.x]) {
       current = (threadIdx.x * j) + ptr;

       for(int k=0;k<Nmax;k++){
          ......
       }

       ptr+=NbeamletsPerbeam[blockIdx.x];
    }
 }   
} 

This should do the trick and gives you better parallelization.

veda
  • 6,416
  • 15
  • 58
  • 78