45

I'm new to CUDA, and I can't understand loop unrolling. I've written a piece of code to understand the technique

__global__ void kernel(float *b, int size)
{
    int tid = blockDim.x * blockIdx.x + threadIdx.x;
 #pragma unroll
    for(int i=0;i<size;i++)
        b[i]=i;
}

Above is my kernel function. In main I call it like below

int main()
{
    float * a; //host array
    float * b; //device array
    int size=100;

    a=(float*)malloc(size*sizeof(float));
    cudaMalloc((float**)&b,size);
    cudaMemcpy(b, a, size, cudaMemcpyHostToDevice);

    kernel<<<1,size>>>(b,size); //size=100

    cudaMemcpy(a, b, size, cudaMemcpyDeviceToHost);

    for(int i=0;i<size;i++)
        cout<<a[i]<<"\t";

    _getch();

    return 0;
}

Does it mean I have size*size=10000 threads running to execute the program? Are 100 of them created when loop is unrolled?

einpoklum
  • 118,144
  • 57
  • 340
  • 684
Magzhan Abdibayev
  • 657
  • 1
  • 7
  • 12
  • 4
    No. It means you have called a CUDA kernel with one block and that one block has 100 active threads. You're passing `size` as the second function parameter to your kernel. In your kernel each of those 100 threads executes the `for` loop 100 times. I advise you to start learning CUDA with basics and move gradually rather than to jump to a more advanced or less important material like loop unrolling. – Farzad Mar 09 '14 at 05:46
  • @Farsad,thank you, could you explain what #pragma unroll does ? i think that i could execute for loop without using pragma? – Magzhan Abdibayev Mar 09 '14 at 06:03

1 Answers1

59

No. It means you have called a CUDA kernel with one block and that one block has 100 active threads. You're passing size as the second function parameter to your kernel. In your kernel each of those 100 threads executes the for loop 100 times.

#pragma unroll is a compiler optimization that can, for example, replace a piece of code like

for ( int i = 0; i < 5; i++ )
    b[i] = i;

with

b[0] = 0;
b[1] = 1;
b[2] = 2;
b[3] = 3;
b[4] = 4;

by putting #pragma unroll directive right before the loop. The good thing about the unrolled version is that it involves less processing load for the processor. In case of for loop version, the processing, in addition to assigning each i to b[i], involves i initialization, evaluating i<5 for 6 times, and incrementing i for 5 times. While in the second case, it only involves filing up b array content (perhaps plus int i=5; if i is used later). Another benefit of loop unrolling is the enhancement of Instruction-Level Parallelism (ILP). In the unrolled version, there would possibly be more operations for the processor to push into processing pipeline without being worried about the for loop condition in every iteration.

Posts like this explain that runtime loop unrolling cannot happen for CUDA. In your case CUDA compiler doesn't have any clues that size is going to be 100 so compile-time loop unrolling will not occur, and so if you force unrolling, you may end up hurting the performance.

If you are sure that the size is 100 for all executions, you can unroll your loop like below:

#pragma unroll
for(int i=0;i<SIZE;i++)  //or simply for(int i=0;i<100;i++)
    b[i]=i;

in which SIZE is known in compile time with #define SIZE 100.

I also suggest you to have proper CUDA error checking in your code (explained here).

Community
  • 1
  • 1
Farzad
  • 3,288
  • 2
  • 29
  • 53
  • 11
    `#pragma unroll` is also covered in the [programming guide](http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#pragma-unroll). – Robert Crovella Mar 09 '14 at 07:10
  • 1
    And its mostly because it kills your parallel computing performance. because warps of threads are not paralel when the is a branch condition, that offers any thread in the block to stray in a different instruction path and that voids SIMT architecture which is 1 instruction (register ?) only being executed by all threads in warp @same time & same place aka paralel – yan bellavance Jul 17 '17 at 00:34
  • @RobertCrovella Why do we need to explicitly add this pragma? Can't compiler recognize such loops by himself? – z0lupka May 26 '21 at 10:35
  • I didn't say anywhere that you had to explicitly add this pragma. I simply linked to a relevant section in the documentation, so that others might have a documentation reference. If you click on that link and read the very first sentence, you would have an answer to your question. – Robert Crovella May 26 '21 at 14:13