1

I understand the concept of unrolling loops however, can someone explain to me how to unroll a simple loop?

It would be great if you would show me a loop, and then a unrolled version of that loop with explanations of what is happening.

J-Rock
  • 27
  • 1
  • 3
  • Have you read the wikipedia page? There are some good `c` examples there. The basic idea is that an unrolled loop spends more cycles reading/writing the data you care about than the overhead associated with checking/updating the condition variable(s) governing the loop, and therefore gives you better performance: https://en.wikipedia.org/wiki/Loop_unrolling – yano Apr 13 '16 at 04:12
  • See also: http://stackoverflow.com/a/514289/5399734 – nalzok Apr 13 '16 at 05:22

3 Answers3

8

I think it's important to clarify when loop unrolling is most effective: with dependency chains. A dependency chain is a series of operations where each calculation depends on the previous calculation. For example, the following loop has a dependency chain.

for(i=0; i<n; i++) sum += a[i];

Most modern processors can do multiple out-of-order operations per cycle. This increases the instruction throughput. However, out-of-order operations can't do this in a dependency chain. In the loop above each calculation is bound by the latency of the addition operation.

In the loop above we can unroll it into two dependency chains like this

sum1 = 0, sum2 = 0;
for(i=0; i<n/2; i+=2) sum1 += a[2*i], sum2 += a[2*i+1];
for(i=(n/2)*2; i<n; i++) sum += a[i]; // clean up for n odd
sum += sum1 + sum2;

Now an out-of-order processor could operate on either chain independently and depending on the processor simultaneously.

In general you should unroll by an amount equal to the latency of the operation times the number of those operations that can be done per clock cycle. For example with a x86_64 processor it can perform at least one SSE addition per clock cycle and the SSE addition has a latency of 3 so you should unroll three times. With a Haswell processor it can do two FMA operations per clock cycle and each FMA operations has a latency of 5 so you would need to unroll 10 times to get the maximum throughput.

As far as compilers go GCC does not unroll dependency chains (even with -funroll-loops). You have to unroll yourself with GCC. With Clang it unrolls four times which is generally pretty good (in some cases on Haswell and Broadwell you would need to unroll 10 times and with Skylake 8 times).


Another reason to unroll is when the number of operations in a loop exceeds the number of instructions which can be push through per clock cycle. For example in the following loop

for(i=0; i<n; i++) b[i] += 3.14159*a[i];

there is no dependency chain so there is no problem with out-of-order execution. But let's consider an instruction set which needs the following operations per iteration.

2 SIMD load
1 SIMD store
1 SIMD multiply
1 SIMD addition
1 scalar addition for the loop counter
1 conditional jump

Let's also assume the the processor can push through five of these instructions per cycle. In this case there are seven instructions per iteration but only five can be done per cycle. Loop unrolling can then be used to amortize the cost of the scalar addition to the counter i and the conditional jump. For example if you fully unrolled the loop these instruction would not be necessary.

For amortizing the cost of the loop counter and jump -funroll-loops works fine with GCC . It unrolls eight times which means the counter addition and jump has to be done once every eight iteration instead of every iteration.

Z boson
  • 32,619
  • 11
  • 123
  • 226
4

The process of unrolling loops utilizes an essential concept in computer science: the space-time tradeoff, where increasing the space used can often lead to decreasing the time of an algorithm.

Let's say we have a simple loop,

const int n = 1000;

for (int i = 0; i < n; ++i) {
    foo();
}

This is compiled to assembly looking something like this:

mov eax, 0

loop:

call foo
inc eax
cmp eax, 1000
jne loop

So the space-time trade-off is 5 lines of assembly for ~(4 * 1000) = ~4000 instructions executed.

So, let's try and unroll the loop a bit.

for (int i = 0; i < n; i += 10) {
    foo();
    foo();
    foo();
    foo();
    foo();
    foo();
    foo();
    foo();
    foo();
    foo();
}

And its assembly:

mov eax, 0

loop:

call foo
call foo
call foo
call foo
call foo
call foo
call foo
call foo
call foo
call foo
add eax, 10
cmp eax, 1000
jne loop

The space-time trade-off is 14 lines of assembly for ~(14 * 100) = ~1400 instructions executed.

We can do a total unrolling, like this:

foo();
foo();
// ...
// 996 foo()'s
// ...
foo();
foo();

Which compiles in assembly as 1000 call instructions.

This gives a space-time trade-off of 1000 lines of assembly for 1000 instructions.

As you can see, the general trend is that to reduce the amount of instructions executed by the CPU, we must increase the space required.

It is not efficient to totally unroll a loop, as the space required becomes extremely large. Partial unrolling gives huge benefits with greatly diminishing returns the more you unroll the loop.

While it's a good idea to understand loop unrolling, keep in mind that the compiler is smart and will do it for you.

Deus Sum
  • 146
  • 2
  • 1
    Your should *not* assume your compiler will unroll the loop. In dependency chains GCC will not unroll the loop. And even if you use `-funroll-loops` it does not break the dependency chain. Manual loop unrolling is still useful in optimization. – Z boson Apr 13 '16 at 06:22
  • Nice answer. I wonder if *loop-unrolling* is a common feature that I can expect in most of the compiled languages or just in C & related. Also, I guess unrolling is not possible with interpreted languages, right? (sorry if I diverge a little in off-topic) – Kamafeather Sep 12 '18 at 14:56
1

Rolled (regular):

#define N 44

int main() {
    int A[N], B[N];
    int i;

    // fill A with stuff ...

    for(i = 0; i < N; i++) {
        B[i] = A[i] * (100 % i);
    }

    // do stuff with B ...
}

Unrolled:

#define N 44

int main() {
    int A[N], B[N];
    int i;

    // fill A with stuff ...

    for(i = 0; i < N; i += 4) {
        B[i] = A[i] * (100 % i);
        B[i+1] = A[i+1] * (100 % i+1);
        B[i+2] = A[i+2] * (100 % i+2);
        B[i+3] = A[i+3] * (100 % i+3);
    }

    // do stuff with B ...
}

Unrolling can potentially increase performance at the cost of a larger program size. Performance increases could be due to a reduction in branch penalties, cache misses and execution instructions. Some disadvantages are obvious, like an increase in the amount of code and a decrease in readability, and some are not so obvious.

Jonny Henly
  • 4,023
  • 4
  • 26
  • 43