11

I have following code using loop unrolling:

#pragma unroll
for (int i=0;i<n;i++)
{
    ....
}

here if n is a defined constant, everything works fine. However, if n is a variable, performance dramatically reduced. I noticed roughly 3 times the instructions are issued and executed. I guess I am looking for a way to do loop unrolling at run time, may be that's just not feasible.

einpoklum
  • 118,144
  • 57
  • 340
  • 684
small_potato
  • 3,127
  • 5
  • 39
  • 45

1 Answers1

20

CUDA is a compiled language. Loop unrolling is a compiler optimization. Runtime loop unrolling would imply some sort of runtime interpreter or dynamic code generation. That clearly can't happen.

It would make sense that the unrolled case executes as many or more instructions than the naïve loop, because the compiler will replace the loop with repetitions of the loop contents. If the unrolled case executes less instructions, that would imply that the compiler is pre-calculating some or all of the loop contents and replacing code with a constant result.

It all depends on what is contained in the loop.

talonmies
  • 70,661
  • 34
  • 192
  • 269
  • Even more basic, constants are known at compile time, variables are not. – Albert Perrien Apr 01 '11 at 02:52
  • 3
    Loop unrolling can and does occur with the open64 even when the trip count of the loop is not known at compile time, and it is often an optimization to do it, because it can increase instruction level parallelism. The "mystery" here is the volume of instructions in the non constant trip count case, and that must be due to code substitution rather than loop unrolling for the constant trip case. – talonmies Apr 01 '11 at 06:40
  • I'll agree with you on the code substitution. I guess I missed out with an assumption that the statements in the loop were not independent. As you said, it all depends on what's in the loop. – Albert Perrien Apr 01 '11 at 18:05