In which scenario would "unroll-loops" not making result code faster?

Question

Taken from GCC manual:

-funroll-loops
           Unroll loops whose number of iterations can be determined at compile time or upon entry to the loop.
           -funroll-loops implies -frerun-cse-after-loop.  This option makes code larger, and may or may not make it
           run faster.

According to my understanding, unroll loops will get rid of branching instructions in resulted code, I presume it is healthier for CPU pipelines.

But why would it "may not make it run faster"?

When the loop body is large, the loop-control overhead is trivial, and the larger code size due to unrolling can cause instruction cache misses. — Daniel Fischer, Jun 17 '13 at 23:05
Loop unrolling almost always results in slower code in most large applications. It screws with the cache. Of course only a profiler will tell you if it's true for your specific application. — Wiz, Jun 17 '13 at 23:05
possible duplicate of [How do optimizing compilers decide when and how much to unroll a loop?](http://stackoverflow.com/questions/7691215/how-do-optimizing-compilers-decide-when-and-how-much-to-unroll-a-loop) — Jim Balter, Jun 17 '13 at 23:05

score 8 · Accepted Answer · answered Jun 17 '13 at 23:05

First of all, it may not make any difference; if your condition is "simple" and executed many times the branch predictor should quickly pick it up and always predict correctly the branch until the end of the loop, making the "rolled" code run almost as fast as the unrolled code.

Also, on non-pipelined CPUs the cost of a branch is quite small, so such optimization may not be relevant and code size considerations may be much more important (e.g. when compiling for a microcontroller - remember that gcc targets range from AVR micros to supercomputers).

Another case where unrolling can't speed up a loop is when the loop body is much slower than the looping itself - if e.g. you have a syscall in the body loop the loop overhead will be negligible compared to the system call.

As for when it may make your code run slower, making the code bigger can slow it down - if your code doesn't fit anymore in cache/memory page/... you'll have a cache/page/... fault and the processor will have to wait for the memory to fetch the code before executing it.

score 1 · Answer 2 · answered Jun 18 '13 at 01:48

The answers so far are very good, but I'll add one thing that hasn't been touched on yet: eating up branch predictor slots. If your loop contains a branch, and it's not unrolled, it only consumes one branch predictor slot, so it won't evict other predictions the cpu has made in the outer loops, sister loops, or caller. However, if the loop body is duplicated many times via unrolling, each copy will contain a separate branch which consumes a predictor slot. This kind of performance hit is easily unnoticed, because, like cache eviction issues, it will not be visible in most isolated, artificial measurements of the loop performance. Instead, it will manifest as hurting the performance of other code.

As a great example, the fastest strlen on x86 (even better than the best asm I've seen) is an insanely unrolled loop that simply does:

if (!s[0]) return s-s0;
if (!s[1]) return s-s0+1;
if (!s[2]) return s-s0+2;
/* ... */
if (!s[31]) return s-s0+31;

However, this will tear through branch predictor slots, so for real-world purposes, some sort of vectorized approach is preferable.

score 1 · Answer 3 · answered Jun 18 '13 at 04:23

I don't think it's common to fill an unrolled loop with conditional exits. That breaks most of the instruction scheduling which unrolling allows. What's more common is to check beforehand that the loop has at least n iterations remaining before entering into the unrolled section.

To acheive this the compiler may generate elaborate preamble and postamble to align the loop data for better vectorisation or better instruction scheduling, and to handle the remainder of the iterations which do not divide evenly into the unrolled section of the loop.

It can turn out (worst possible case) that the loop only runs zero or one time, or maybe twice in exceptional circumstances. Then only a small part of the loop would be executed, but many extra tests would be performed to get there. Worse; the alignment preamble might mean that different branch conditions occur in different calls, causing additional branch misprediction stalls.

These are all meant to cancel out over a large number of iterations, but for short loops this doesn't happen.

On top of this, you have the increased code size, where all of these unrolled loops together contribute to reducing icache efficiency.

And some architectures special-case very short loops to use their internal buffers without even referring to the cache.

And modern architectures have fairly extensive instruction reordering, even around memory accesses, which means that the compiler's reordering of the loop might offer no additional benefits even in the best case.

score 0 · Answer 4 · answered Jun 17 '13 at 23:05

0

For example, unrolled function body larger than cache. Reading from memory is obviously slower.

answered Jun 17 '13 at 23:05

milleniumbug

15,379
3
47
71

score 0 · Answer 5 · answered Jun 17 '13 at 23:06

Say you have a loop with 25 instructions and iterates 1000 times. The extra resources required to handle the 25,000 instructions could very well override the pain caused by branching.

It is also important to note that many kinds of looping branches are very painless, as the CPU has gotten quite good at branch predictions for simpler situations. For instance 8 iterations is probably more efficient unrolled, but even 50 is probably better left off to the CPU. Note that the compiler is probably better at guessing which is superior than you are.

score -1 · Answer 6 · answered Jun 17 '13 at 23:12

-1

Unrolling loops should always make the code faster. The trade-off is between faster code and larger code footprint. Tight loops (relatively small amounts of code executed in the body of the loop) which are executed a significant number of times can benefit from unrolling by removing all the loop overhead, and allowing the pipelining to do its thing. Loops which go through many iterations may unroll to a large amount of extra code - faster but maybe unacceptably larger footprint for the performance gain. Loops with a lot going on in the body may not benefit significantly from unrolling - the loop overhead becomes small compared to everything else.

answered Jun 17 '13 at 23:12

Zenilogix

1,318
1
15
31

It won't always be faster. It depends heavily on the specific emitted code and processor/hardware. – user2246674 Jun 17 '13 at 23:14
"removing all the loop overhead, and allowing the pipelining to do its thing" -- this is a misunderstanding; see http://en.wikipedia.org/wiki/Software_pipelining – Jim Balter Jun 18 '13 at 01:13
Maybe there's a better way I could have written that – Zenilogix Jun 18 '13 at 02:05
I am looking at this question specifically because unrolling made things _slower_. – Richard Feb 11 '16 at 05:18

In which scenario would "unroll-loops" not making result code faster?

6 Answers6