Can modern compilers unroll `for` loops expressed using begin and end iterators

Question

Consider the following code

 vector<double> v;
 // fill v
 const vector<double>::iterator end =v.end();
 for(vector<double>::iterator i = v.bgin(); i != end; ++i) {
   // do stuff
 }

Are compilers like g++, clang++, icc able to unroll loops like this. Unfortunately I do not know assembly to be able verify from the output whether the loop gets unrolled or not. (and I only have access to g++.)

To me it seems that this will require more smartness than usual on behalf of the compiler, first to deduce that the iterator is a random access iterator, and then figure out the number of times the loop is executed. Can compilers do this when optimization is enabled ?

Thanks for your replies, and before some of you start lecturing about premature optimization, this is an excercise in curiosity.

Since `end` is a dynamic quantity, there's only so much unrolling you can do... — Kerrek SB, Jul 17 '12 at 18:56
I'm going to guess that there is no unrolling to be done since the size cannot be predicted. — , Jul 17 '12 at 18:57
+1 for preemtive protection against accusations of premature optimization :-) — Aasmund Eldhuset, Jul 17 '12 at 18:57
It's definitely possible to unroll without knowing the trip count. It just takes a bit of clean-up code... I've done it many times before. — Mysticial, Jul 17 '12 at 18:58
@Kerrek SB I think `end` is not dynamic, I define it outside of the loop as a constant. — san, Jul 17 '12 at 19:01
It *can* of course, but *will* it? I don't know, so I'm just going to try it. — harold, Jul 17 '12 at 19:03
@harold Thanks for that. As I said I dont know how to read the generated assembly to figure out how well it is doing. — san, Jul 17 '12 at 19:05
@san You define it as `const`, not as a constant. Its value is determined at runtime, not at compile-time, so constness cannot be used to optimize/unroll the loop. — Jonathan Grynspan, Jul 17 '12 at 19:07
@JonathanGrynspan I see. It is indeed not a compile time constant. What I hoped is that the compiler will now know that the position of the `end` does not change inside the body of the loop and use that information. This is as opposed to defining the loop with `for(..; i !=v.end(); ++i)` — san, Jul 17 '12 at 19:12
MSVC did not unroll this loop in any of my tests (full optimization level, favor speed and fp:fast) — harold, Jul 17 '12 at 19:18
@harold Could you add it as an answer so that I could upvote it. — san, Jul 17 '12 at 19:20
@san : Just because VC++ doesn't doesn't mean that no other compiler does. ;-] — ildjarn, Jul 17 '12 at 19:38
I tested with clang++ 3.1 and llvm-g++ 4.2 from from Xcode 4.3.3, g++ 4.2 from Xcode 4.0, and g++ 4.7 from Homebrew on a pair of recent-ish Macs, and a variety of different settings (including -funroll-all-loops of course), and a loop that does nothing but copy *i to a volatile. The loop was never unrolled. However, that may well be because unrolling is a pessimization, on x86_64, with that loop body, so it says nothing about whether the compiler _could_ unroll the loop if it were worth doing. — abarnert, Jul 17 '12 at 19:52
@abarnert: I am definitely getting loop unrolling with `-funroll-loops` (x86-64 linux, g++ 4.4.4 and g++ 4.1.2). — jxh, Jul 17 '12 at 20:11
@user315052 Could you update your answer with this. I have already uvoted so I cant anymore, but this is valuable information. — san, Jul 17 '12 at 20:15
@jxh are you getting loop unrolling or did you unrolled this particular loop? — mip, Nov 30 '14 at 03:18

score 8 · Answer 1 · answered Jul 17 '12 at 19:19

8

I would propose that whether or not the compiler CAN unroll the loop, with modern pipelined architectures and caches, unless your "do stuff" is trivial, there is little benefit in doing so, and in many cases doing so would be a performance HIT instead of a boon. If your "do stuff" is nontrivial, unrolling the loop will create multiple copies of this nontrivial code, which will take extra time to load into the cache, significantly slowing down the first iteration through the unrolled loop. At the same time, it will evict more code from the cache, which may have been necessary for performing the "do stuff" if it makes any function calls, which would then need to be reloaded into the cache again. The purpose for unrolling loops made a lot of sense before cacheless pipelined non-branch-predictive architectures, with the goal being to reduce the overhead associated with the loop logic. Nowadays with cache-based pipelined branch-predictive hardware, your cpu will be pipelined well into the next loop iteration, speculatively executing the loop code again, by the time you detect the i==end exit condition, at which point the processor will throw out that final speculatively-executed set of results. In such an architecture, loop unrolling makes very little sense. It would further bloat code for virtually no benefit.

answered Jul 17 '12 at 19:19

phonetagger

7,701
3
31
55

2

+1, because this is a good explanation for why it usually won't unroll the loop, even if it can. But the OP is asking whether, if it were ever worth doing, the compiler would be able to do so, and I'm not sure how to completely answer that without actually dreaming up a case where it would be worth doing… – abarnert Jul 17 '12 at 19:55
@abarnert I think that coying to a `volatile` does qualify as'trivial' and hence a great test case. Cannot think of anything shorter that wont get optimized away completely – san Jul 17 '12 at 19:59
@san: Actually, it looks like copying _from_ a volatile is a better test case, given user315052's answer. And maybe putting inline asm inside the loop would be even more trivial? Oh well, we've got the answer proven, so no more effort seems needed. – abarnert Jul 17 '12 at 22:43
@abarnert It also could be that the deciding factor there was the compiler, 'cause number of instructions wise they should be quite equivalent. – san Jul 17 '12 at 22:55
@san: Maybe, but g++ 4.2 and 4.7 doing one thing while 4.1 and 4.4 do the other seems a bit odd. Not impossible, and of course apple-llvm-g++ 4.2 could easily be different than stock 4.2… – abarnert Jul 17 '12 at 23:14
@abarnert actually I hadn't read your comment that you were able to coax unrolling by switching the source and destination. So your huch is correct. – san Jul 17 '12 at 23:36
true, although loop unrolling still makes sense for simple, small (in terms of number of operations) loops, like when you add one vector (mathematics, not std::) to another. Compiler can then perform further optimizations like vectorization, which will make it possible to execute whole loop in single step or significantly reduce number of steps. – mip Nov 30 '14 at 02:56

jxh · Accepted Answer · 2019-11-07T21:12:29.847

8

To me it seems that this will require more smartness than usual on behalf of the compiler, first to deduce that the iterator is a random access iterator, and then figure out the number of times the loop is executed.

The STL, being comprised entirely of templates, has all the code inline. So, random access iterators reduce to pointers already when the compiler begins to apply optimizations. One of the reasons the STL was created was so that there would be less need for a programmer to outwit the compiler. You should rely on the STL to do the right thing until proven otherwise.

Of course, it is still up to you to choose the proper tool from the STL to use...

Edit: There was discussion about whether g++ does any loop unrolling. On the versions that I am using, loop unrolling is not part of -O, -O2, or -O3, and I get identical assembly for the latter two levels with the following code:

void foo (std::vector<int> &v) {
    volatile int c = 0;
    const std::vector<int>::const_iterator end = v.end();
    for (std::vector<int>::iterator i = v.begin(); i != end; ++i) {
        *i = c++;
    }
}

With the corresponding assembly -O2 assembly:

_Z3fooRSt6vectorIiSaIiEE:
.LFB435:
        movq    8(%rdi), %rcx
        movq    (%rdi), %rax
        movl    $0, -4(%rsp)
        cmpq    %rax, %rcx
        je      .L4
        .p2align 4,,10
        .p2align 3
.L3:
        movl    -4(%rsp), %edx
        movl    %edx, (%rax)
        addq    $4, %rax
        addl    $1, %edx
        cmpq    %rax, %rcx
        movl    %edx, -4(%rsp)
        jne     .L3
.L4:
        rep
        ret

With the -funroll-loops option added, the function expands into something much much larger. But, the documentation warns about this option:

Unroll loops whose number of iterations can be determined at compile time or upon entry to the loop. -funroll-loops implies -frerun-cse-after-loop. It also turns on complete loop peeling (i.e. complete removal of loops with small constant number of iterations). This option makes code larger, and may or may not make it run faster.

As a further argument to dissuade you from unrolling loops yourself, I'll finish this answer with an illustration of applying Duff's Device to the foo function above:

void foo_duff (std::vector<int> &v) {
    volatile int c = 0;
    const std::vector<int>::const_iterator end = v.end();
    std::vector<int>::iterator i = v.begin();
    switch ((end - i) % 4) do {
    case 0: *i++ = c++;
    case 3: *i++ = c++;
    case 2: *i++ = c++;
    case 1: *i++ = c++;
    } while (i != end);
}

GCC has another loop optimization flag:

-ftree-loop-optimize
Perform loop optimizations on trees. This flag is enabled by default at -O and higher.

So, the -O option enables simple loop optimizations for the innermost loops, including complete loop unrolling (peeling) for loops with a fixed number of iterations. (Thanks to doc for pointing this out to me.)

edited Nov 07 '19 at 21:12

answered Jul 17 '12 at 19:30

jxh

69,070
8
110
193

Thanks for the detailed reply. One question: should the `const_iterator` be a `const` iterator ? Doesnt the former mean that the dereferenced object is constant, but the iterator is not ? – san Jul 17 '12 at 20:47
@san: I used `const_iterator` to show I did not intend to use `end` to modify anything. You could also make `end` `const` to indicate the variable itself will not change. But I think the compiler can infer that from the fact it is local, and not being used in the loop. – jxh Jul 17 '12 at 20:55
Right, I thought that the compiler figures it out the other way around from the fact that end is not supposed to be dereferenced. But I guess one could use some nasty tricks by dereferencing beyond the container, so the compiler cannot disallow it. – san Jul 17 '12 at 21:00
@san: No, I think you are right, the compiler can figure it out, so long as the local iterator is not passed to some other function by reference. But, I added the other `const`, just for you. – jxh Jul 17 '12 at 21:14
1

Aha. If I change my test code to copy from the volatile to *i, instead of the other way around, at least apple-gcc-4.2 actually does unroll the loop. Strange that would make a difference, but… that's a good illustration of the fact that it's harder to prove a negative ("the compiler can't do X because I can't come up with an example that makes it do X" doesn't work) than a positive ("the compiler can do X because I came up with an example that makes it do X"). :) – abarnert Jul 17 '12 at 22:45
g++ unrolls counted loops. I don't think compiler is able to unroll iterator based loop. – mip Nov 30 '14 at 02:27
@doc: The answer mentions that it is not done be default, but will be done if passed the `-funroll-loops` option. – jxh Nov 30 '14 at 05:22
@jxh Iterator based loops are not that easy to unroll as counted loops. If `v` were argument of a function it would be completely impossible to unroll this loop. For local object it may be possible, but would require some effort from the compiler. Don't know how it's now, but some time ago g++ was not able to unroll uncountable loops. – mip Nov 30 '14 at 05:39
@doc: In my answer, I already stated I tried it and it was unrolled. – jxh Nov 30 '14 at 06:43
@jxh yes you suspect it was unrolled, but it's not true loop unrolling. What you see is gcc trying to break loop into chunks, but it's not trully unrolled. That's why you see a lot of additional code (also because with std::vector some of its internals could be "fake unrolled"). – mip Nov 30 '14 at 07:13
@doc: I know it was unrolled. It was just too long and boring to present in the answer. – jxh Nov 30 '14 at 08:05
@jxh no it wasn't. Take a look at my answer where I explain this. – mip Nov 30 '14 at 09:32
@abarnert compiler is not able to unroll these loops, unless they are working on local objects (still this requires some preconditions to unroll them). – mip Nov 30 '14 at 10:04
@jxh it's not loop unrolling. It's called loop peeling or loop splitting. – mip Nov 30 '14 at 10:49
@doc: [Wikipedia](http://en.wikipedia.org/wiki/Loop_unrolling) explains what *loop unrolling* means, and it doesn't mean what you seem to think it means. *Loop peeling/splitting* is used to reduce the number of iterations to be a multiple of K, and then the loop is unrolled from N iterations into N/K iterations. – jxh Nov 30 '14 at 10:54
@jxh and that's what you see in the disassembly. Loop peeling + loop unrolling (I admit that, dunno why, I always thought of loop unrolling as complete unrolling, even though I knew about Duff's device. Even though I had different thing in my mind than most people, still my point is valid, because compiler will in most cases choose to not unroll loop, since partial unrolling [with peeling and a lot of code] is probably too expensive). But it will unroll counted loop (with fixed condition). – mip Nov 30 '14 at 12:30
@doc: It is just optimizing the loop for a small count. The compiler will make a decision to do simple optimizations like that when it is a no-brainer. But put in a much larger fixed count, and it won't touch the loop. – jxh Nov 30 '14 at 12:47
@jxh this optimization is nothing else, but loop unrolling. Whether it's small loop or not, simple optimization or not, it is loop unrolling. – mip Nov 30 '14 at 13:16
@doc: As I already explained in a different comment, the outcome of the optimization on what you call a "counted loop" is an unrolled loop if the count is small, the optimization is not applied to any counted loop. Different optimization techniques can result in similar code outcomes. In this case, the optimizer may be removing an unnecessary branch instruction, and not really applying the *loop unrolling* optimization. This would explain why it is only happening for small loops. – jxh Nov 30 '14 at 20:01
@jxh it's not applied to larger loops because it is not beneficial to unroll these loops. Of course it's removing unnecessary branch instruction, because that's what every loop essentialy is - a branch instruction. – mip Nov 30 '14 at 21:02
@doc: You keep making this assertion that it would not be beneficial, but until I corrected you, you were not clear on what the optimization meant. It may seem like I am being stubborn, but it is clear you are not doing your homework. Try manually unrolling the large loop yourself and measure whether or not it is beneficial. – jxh Nov 30 '14 at 21:50
@jxh I've admitted that I made a mistake when it comes to terminology. But I see what's going on in disassembly and I can't admit that there's no loop unrolling if there is one. – mip Dec 01 '14 at 08:51
@jxh this assertion is obvious. Complete unrolling of small loops is beneficial, because generated code is by all means simpler and smaller. With larger loops it starts to become a tradeoff, because larger code not only occupies memory, but may cause cache misses. That's why `funroll-loops` is disabled by default and you have it written that "This option makes code larger, and may or may not make it run faster". With small loops it's not the case, compiler chooses to unroll them straight away with standard optimization flags. – mip Dec 01 '14 at 09:08
@doc: (1) You do not know if it would be beneficial or not, because you have not tried. I already explained `-funroll-loops` needs timing information to know the optimal amount of unrolling, but that doesn't mean no unrolling is more optimal. (2) I believe the documentation when it is saying `-funroll-loops` are not enabled by default, so that is why I assert the optimization you are observing is not an application of the loop unrolling optimization, even though the outcome of the optimization is an unrolled loop. – jxh Dec 01 '14 at 21:15
@jxh `"you do not know if it would be beneficial..."` Come on... It's obvious that 4 `addsd` instructions are faster than 4 `addsd` inside loop = `addsd` instruction + loop opcodes to repeat it 4 times (jump, counter initialization, incrementation and comparison). And it's not me, but compiler who decides to optimize code in such way. So, write to compiler building team that they get it wrong, by making such assumption. – mip Dec 02 '14 at 06:56
@doc: You should realize that there is nothing obvious about optimizing code. The only difference may be more code, not in execution time. E.g., on some machine, the instructions could actually be executed simultaneously for each iteration. My point is that you cannot know what is better or not until proper profiling experiments are done. – jxh Dec 02 '14 at 07:15
@jxh yes, on some strange architecture from outer space it could be slower to execute 4 additions one after another instead of executing them inside loop. But to end this I took an effort to look at gcc internals and loop unrolling is implemented in at least two different places. `funroll-loops` and `funroll-all-loops` use pass defined in `loop-unroll.c`. But loop unrolling is also performed in iv_canon pass, which is defined here https://github.com/gcc-mirror/gcc/blob/25dbe4c7d68440a4caf6ad5b37d62f4b746760ca/gcc/tree-ssa-loop-ivcanon.c and enabled by `ftree-loop-optimize` on levels `-O` & up. – mip Dec 02 '14 at 08:13
@jxh Excerpt from tree-ssa-loop-ivcanon.c `" We also perform - complete unrolling (or peeling) when the loops is rolling few enough times"`. So, I think it closes this discussion. – mip Dec 02 '14 at 08:17
@jxs TBH `ftree-loop-optimize` is not the only flag, which controls loop optimizations. It looks like it is required to turn this option at first, to enable optimization on trees, but additional flags may be required to unroll loops. So it would be safer to say that "`-O` enables simple loop optimizations than" than `"This option enables..."`. – mip Dec 02 '14 at 14:43

score 1 · Answer 3 · answered Jul 17 '12 at 18:58

1

The short answer is yes. It will unroll as much as it can. In your case, it depends how you define end obviously (I assume your example is generic). Not only will most modern compilers unroll, but they will also vectorize and do other optimizations that will often blow your own solutions out of the water.

So what I'm saying is don't prematurely optimize! Just kidding :)

answered Jul 17 '12 at 18:58

David Titarenco

32,662
13
66
111

I define end as a constant iterator outside of the loop. – san Jul 17 '12 at 19:04
It won't unroll as much as it can; it'll unroll as much as it thinks is worth unrolling. And in many cases, that will be not at all. – abarnert Jul 17 '12 at 19:53
I don't think compiler is able to unroll iterator based loop as it can do with counted loops. `v.end()` is not a constexpr, so how compiler can know where loop ends at compile time? Maybe with some mad tracing of pointers it could optimize local object, but it's totally impossible when `v` would be a function argument. – mip Nov 30 '14 at 03:06

mip · Answer 4 · 2014-11-30T11:05:23.230

Simple answer: generally NO! At least when it comes to complete loop unrolling.

Let's test loop unrolling on this simple, dirty-coded (for testing purposes) structure.

struct Test
{
    Test(): begin(arr), end(arr + 4) {}

    double * begin;
    double * end;
    double arr[4];
};

First let's take counted loop and compile it without any optimizations.

double counted(double param, Test & d)
{
    for (int i = 0; i < 4; i++)
        param += d.arr[i];
    return param;
}

Here's what gcc 4.9 produces.

counted(double, Test&):
    pushq   %rbp
    movq    %rsp, %rbp
    movsd   %xmm0, -24(%rbp)
    movq    %rdi, -32(%rbp)
    movl    $0, -4(%rbp)
    jmp .L2
.L3:
    movq    -32(%rbp), %rax
    movl    -4(%rbp), %edx
    movslq  %edx, %rdx
    addq    $2, %rdx
    movsd   (%rax,%rdx,8), %xmm0
    movsd   -24(%rbp), %xmm1
    addsd   %xmm0, %xmm1
    movq    %xmm1, %rax
    movq    %rax, -24(%rbp)
    addl    $1, -4(%rbp)
.L2:
    cmpl    $3, -4(%rbp)
    jle .L3
    movq    -24(%rbp), %rax
    movq    %rax, -40(%rbp)
    movsd   -40(%rbp), %xmm0
    popq    %rbp
    ret

As expected loop hasn't been unrolled and, since no optimizations were performed, code is generally very verbose. Now let's turn on -O3 flag. Produced disassembly:

counted(double, Test&):
    addsd   16(%rdi), %xmm0
    addsd   24(%rdi), %xmm0
    addsd   32(%rdi), %xmm0
    addsd   40(%rdi), %xmm0
    ret

Voila, loop has been unrolled this time.

Now let's take a look at iterated loop. Function containing the loop will look like this.

double iterated(double param, Test & d)
{
  for (double * it = d.begin; it != d.end; ++it)
    param += *it;
  return param;
}

Still using -O3 flag, let's take a look at disassembly.

iterated(double, Test&):
    movq    (%rdi), %rax
    movq    8(%rdi), %rdx
    cmpq    %rdx, %rax
    je  .L3
.L4:
    addsd   (%rax), %xmm0
    addq    $8, %rax
    cmpq    %rdx, %rax
    jne .L4
.L3:
    rep ret

Code looks better than in the very first case, because optimizations were performed, but loop hasn't been unrolled this time!

What about funroll-loops and funroll-all-loops flags? They will produce result similar to this

iterated(double, Test&):
    movq    (%rdi), %rsi
    movq    8(%rdi), %rcx
    cmpq    %rcx, %rsi
    je  .L3
    movq    %rcx, %rdx
    leaq    8(%rsi), %rax
    addsd   (%rsi), %xmm0
    subq    %rsi, %rdx
    subq    $8, %rdx
    shrq    $3, %rdx
    andl    $7, %edx
    cmpq    %rcx, %rax
    je  .L43
    testq   %rdx, %rdx
    je  .L4
    cmpq    $1, %rdx
    je  .L29
    cmpq    $2, %rdx
    je  .L30
    cmpq    $3, %rdx
    je  .L31
    cmpq    $4, %rdx
    je  .L32
    cmpq    $5, %rdx
    je  .L33
    cmpq    $6, %rdx
    je  .L34
    addsd   (%rax), %xmm0
    leaq    16(%rsi), %rax
.L34:
    addsd   (%rax), %xmm0
    addq    $8, %rax
.L33:
    addsd   (%rax), %xmm0
    addq    $8, %rax
.L32:
    addsd   (%rax), %xmm0
    addq    $8, %rax
.L31:
    addsd   (%rax), %xmm0
    addq    $8, %rax
.L30:
    addsd   (%rax), %xmm0
    addq    $8, %rax
.L29:
    addsd   (%rax), %xmm0
    addq    $8, %rax
    cmpq    %rcx, %rax
    je  .L44
.L4:
    addsd   (%rax), %xmm0
    addq    $64, %rax
    addsd   -56(%rax), %xmm0
    addsd   -48(%rax), %xmm0
    addsd   -40(%rax), %xmm0
    addsd   -32(%rax), %xmm0
    addsd   -24(%rax), %xmm0
    addsd   -16(%rax), %xmm0
    addsd   -8(%rax), %xmm0
    cmpq    %rcx, %rax
    jne .L4
.L3:
    rep ret
.L44:
    rep ret
.L43:
    rep ret

Compare results with unrolled loop for counted loop. It's clearly not the same. What we see here is that gcc divided the loop into 8 element chunks. This can increase performance in some cases, because loop exit condition is checked once per 8 normal loop iterations. With additional flags vectorization could be also performed. But it isn't complete loop unrolling.

Iterated loop will be unrolled however if Test object is not a function argument.

double iteratedLocal(double param)
{
  Test d;
  for (double * it = d.begin; it != d.end; ++it)
    param += *it;
  return param;
}

Disassembly produced with only -O3 flag:

iteratedLocal(double):
    addsd   -40(%rsp), %xmm0
    addsd   -32(%rsp), %xmm0
    addsd   -24(%rsp), %xmm0
    addsd   -16(%rsp), %xmm0
    ret

As you can see loop has been unrolled. This is because compiler can now safely assume that end has fixed value, while it couldn't predict that for function argument.

Test structure is statically allocated however. Things are more complicated with dynamically allocated structures like std::vector. From my observations on modified Test structure, so that it ressembles dynamically allocated container, it looks like gcc tries its best to unroll loops, but in most cases generated code is not as simple as one above.

As you ask for other compilers, here's output from clang 3.4.1 (-O3 flag)

counted(double, Test&):                      # @counted(double, Test&)
    addsd   16(%rdi), %xmm0
    addsd   24(%rdi), %xmm0
    addsd   32(%rdi), %xmm0
    addsd   40(%rdi), %xmm0
    ret

iterated(double, Test&):                     # @iterated(double, Test&)
    movq    (%rdi), %rax
    movq    8(%rdi), %rcx
    cmpq    %rcx, %rax
    je  .LBB1_2
.LBB1_1:                                # %.lr.ph
    addsd   (%rax), %xmm0
    addq    $8, %rax
    cmpq    %rax, %rcx
    jne .LBB1_1
.LBB1_2:                                # %._crit_edge
    ret

iteratedLocal(double):                     # @iteratedLocal(double)
    leaq    -32(%rsp), %rax
    movq    %rax, -48(%rsp)
    leaq    (%rsp), %rax
    movq    %rax, -40(%rsp)
    xorl    %eax, %eax
    jmp .LBB2_1
.LBB2_2:                                # %._crit_edge4
    movsd   -24(%rsp,%rax), %xmm1
    addq    $8, %rax
.LBB2_1:                                # =>This Inner Loop Header: Depth=1
    movaps  %xmm0, %xmm2
    cmpq    $24, %rax
    movaps  %xmm1, %xmm0
    addsd   %xmm2, %xmm0
    jne .LBB2_2
    ret

Intel's icc 13.01 (-O3 flag)

counted(double, Test&):
        addsd     16(%rdi), %xmm0                               #24.5
        addsd     24(%rdi), %xmm0                               #24.5
        addsd     32(%rdi), %xmm0                               #24.5
        addsd     40(%rdi), %xmm0                               #24.5
        ret                                                     #25.10
iterated(double, Test&):
        movq      (%rdi), %rdx                                  #30.26
        movq      8(%rdi), %rcx                                 #30.41
        cmpq      %rcx, %rdx                                    #30.41
        je        ..B3.25       # Prob 50%                      #30.41
        subq      %rdx, %rcx                                    #30.7
        movb      $0, %r8b                                      #30.7
        lea       7(%rcx), %rax                                 #30.7
        sarq      $2, %rax                                      #30.7
        shrq      $61, %rax                                     #30.7
        lea       7(%rax,%rcx), %rcx                            #30.7
        sarq      $3, %rcx                                      #30.7
        cmpq      $16, %rcx                                     #30.7
        jl        ..B3.26       # Prob 10%                      #30.7
        movq      %rdx, %rdi                                    #30.7
        andq      $15, %rdi                                     #30.7
        je        ..B3.6        # Prob 50%                      #30.7
        testq     $7, %rdi                                      #30.7
        jne       ..B3.26       # Prob 10%                      #30.7
        movl      $1, %edi                                      #30.7
..B3.6:                         # Preds ..B3.5 ..B3.3
        lea       16(%rdi), %rax                                #30.7
        cmpq      %rax, %rcx                                    #30.7
        jl        ..B3.26       # Prob 10%                      #30.7
        movq      %rcx, %rax                                    #30.7
        xorl      %esi, %esi                                    #30.7
        subq      %rdi, %rax                                    #30.7
        andq      $15, %rax                                     #30.7
        negq      %rax                                          #30.7
        addq      %rcx, %rax                                    #30.7
        testq     %rdi, %rdi                                    #30.7
        jbe       ..B3.11       # Prob 2%                       #30.7
..B3.9:                         # Preds ..B3.7 ..B3.9
        addsd     (%rdx,%rsi,8), %xmm0                          #31.9
        incq      %rsi                                          #30.7
        cmpq      %rdi, %rsi                                    #30.7
        jb        ..B3.9        # Prob 82%                      #30.7
..B3.11:                        # Preds ..B3.9 ..B3.7
        pxor      %xmm6, %xmm6                                  #28.12
        movaps    %xmm6, %xmm7                                  #28.12
        movaps    %xmm6, %xmm5                                  #28.12
        movsd     %xmm0, %xmm7                                  #28.12
        movaps    %xmm6, %xmm4                                  #28.12
        movaps    %xmm6, %xmm3                                  #28.12
        movaps    %xmm6, %xmm2                                  #28.12
        movaps    %xmm6, %xmm1                                  #28.12
        movaps    %xmm6, %xmm0                                  #28.12
..B3.12:                        # Preds ..B3.12 ..B3.11
        addpd     (%rdx,%rdi,8), %xmm7                          #31.9
        addpd     16(%rdx,%rdi,8), %xmm6                        #31.9
        addpd     32(%rdx,%rdi,8), %xmm5                        #31.9
        addpd     48(%rdx,%rdi,8), %xmm4                        #31.9
        addpd     64(%rdx,%rdi,8), %xmm3                        #31.9
        addpd     80(%rdx,%rdi,8), %xmm2                        #31.9
        addpd     96(%rdx,%rdi,8), %xmm1                        #31.9
        addpd     112(%rdx,%rdi,8), %xmm0                       #31.9
        addq      $16, %rdi                                     #30.7
        cmpq      %rax, %rdi                                    #30.7
        jb        ..B3.12       # Prob 82%                      #30.7
        addpd     %xmm6, %xmm7                                  #28.12
        addpd     %xmm4, %xmm5                                  #28.12
        addpd     %xmm2, %xmm3                                  #28.12
        addpd     %xmm0, %xmm1                                  #28.12
        addpd     %xmm5, %xmm7                                  #28.12
        addpd     %xmm1, %xmm3                                  #28.12
        addpd     %xmm3, %xmm7                                  #28.12
        movaps    %xmm7, %xmm0                                  #28.12
        unpckhpd  %xmm7, %xmm0                                  #28.12
        addsd     %xmm0, %xmm7                                  #28.12
        movaps    %xmm7, %xmm0                                  #28.12
..B3.14:                        # Preds ..B3.13 ..B3.26
        lea       1(%rax), %rsi                                 #30.7
        cmpq      %rsi, %rcx                                    #30.7
        jb        ..B3.25       # Prob 50%                      #30.7
        subq      %rax, %rcx                                    #30.7
        cmpb      $1, %r8b                                      #30.7
        jne       ..B3.17       # Prob 50%                      #30.7
..B3.16:                        # Preds ..B3.17 ..B3.15
        xorl      %r8d, %r8d                                    #30.7
        jmp       ..B3.21       # Prob 100%                     #30.7
..B3.17:                        # Preds ..B3.15
        cmpq      $2, %rcx                                      #30.7
        jl        ..B3.16       # Prob 10%                      #30.7
        movq      %rcx, %r8                                     #30.7
        xorl      %edi, %edi                                    #30.7
        pxor      %xmm1, %xmm1                                  #28.12
        lea       (%rdx,%rax,8), %rsi                           #31.19
        andq      $-2, %r8                                      #30.7
        movsd     %xmm0, %xmm1                                  #28.12
..B3.19:                        # Preds ..B3.19 ..B3.18
        addpd     (%rsi,%rdi,8), %xmm1                          #31.9
        addq      $2, %rdi                                      #30.7
        cmpq      %r8, %rdi                                     #30.7
        jb        ..B3.19       # Prob 82%                      #30.7
        movaps    %xmm1, %xmm0                                  #28.12
        unpckhpd  %xmm1, %xmm0                                  #28.12
        addsd     %xmm0, %xmm1                                  #28.12
        movaps    %xmm1, %xmm0                                  #28.12
..B3.21:                        # Preds ..B3.20 ..B3.16
        cmpq      %rcx, %r8                                     #30.7
        jae       ..B3.25       # Prob 2%                       #30.7
        lea       (%rdx,%rax,8), %rax                           #31.19
..B3.23:                        # Preds ..B3.23 ..B3.22
        addsd     (%rax,%r8,8), %xmm0                           #31.9
        incq      %r8                                           #30.7
        cmpq      %rcx, %r8                                     #30.7
        jb        ..B3.23       # Prob 82%                      #30.7
..B3.25:                        # Preds ..B3.23 ..B3.21 ..B3.14 ..B3.1
        ret                                                     #32.14
..B3.26:                        # Preds ..B3.2 ..B3.6 ..B3.4    # Infreq
        movb      $1, %r8b                                      #30.7
        xorl      %eax, %eax                                    #30.7
        jmp       ..B3.14       # Prob 100%                     #30.7
iteratedLocal(double):
        lea       -8(%rsp), %rax                                #8.13
        lea       -40(%rsp), %rdx                               #7.11
        cmpq      %rax, %rdx                                    #33.41
        je        ..B4.15       # Prob 50%                      #33.41
        movq      %rax, -48(%rsp)                               #32.12
        movq      %rdx, -56(%rsp)                               #32.12
        xorl      %eax, %eax                                    #33.7
..B4.13:                        # Preds ..B4.11 ..B4.13
        addsd     -40(%rsp,%rax,8), %xmm0                       #34.9
        incq      %rax                                          #33.7
        cmpq      $4, %rax                                      #33.7
        jb        ..B4.13       # Prob 82%                      #33.7
..B4.15:                        # Preds ..B4.13 ..B4.1
        ret                                                     #35.14

To avoid misunderstandings. If counted loop condition would rely on external parameter like this one.

double countedDep(double param, Test & d)
{
    for (int i = 0; i < d.size; i++)
        param += d.arr[i];
    return param;
}

Such loop also will not be unrolled.

"...it isn't loop unrolling as such..." is just a false statement unless you have a very strange definition of *loop unrolling*. — jxh, Nov 30 '14 at 10:33
@jxh if it's still a loop then how it became unrolled for you? It's just chopped into 8 element chunks and in expense of extra code and conditions. — mip, Nov 30 '14 at 10:47
[Wikipedia](http://en.wikipedia.org/wiki/Loop_unrolling) explains what *loop unrolling* means, and it doesn't mean what you seem to think it means. — jxh, Nov 30 '14 at 10:52
@jxh yes, but it's not complete loop unrolling. That's what I meant by "loop unrolling as such". — mip, Nov 30 '14 at 11:04
I think it is quite obvious a compiler cannot "completely unroll a loop into something without any loops" unless it knows exactly how many iterations will be performed. It is also obvious to me that completely unrolling a loop is a pointless optimization unless the number of iterations is very small, so I am not sure your narrow point is any improvement over my correct answer. — jxh, Nov 30 '14 at 11:12
@jxh then why have you written that "loop unrolling is not a part of O, O2, O3", while it is? Compiler chooses to not unroll iterated loops, because it's pointless, while it chooses to unroll counted loops, so your answer is evasive at best. Generally it will not unroll loop expressed by iterators. — mip, Nov 30 '14 at 11:23
I stated a correct fact about how GCC works. The documentation lists what options are enabled for each optimization level, and `-funroll-loops` is not part of any of them. What you are showing is not *loop unrolling* as most people recognize what that term means (while "complete loop unrolling" is a term you have invented for this post). — jxh, Nov 30 '14 at 11:30
@jxh if it's not loop unrolling in my examples compiled with `O3` flag then what is it in your opinion? — mip, Nov 30 '14 at 12:09
It is just an optimization for iterating on four statements. Change the size of the array from 4 to 1024, and iterate 1024 times and see what happens. You will see your theory does not hold. — jxh, Nov 30 '14 at 12:42
@jxh obviously this optimization is called *loop unrolling*. Of course compiler may choose to not unroll loop, if it sees that size of generated code (for example for 1024 counts) is not beneficial. Change it to 16 and it unrolls. Haha, my "theory" is not valid because it didn't unroll the loop for 1024 counts? LOL. Why 1024 is the number that prooves (or disprooves) something? — mip, Nov 30 '14 at 12:47
If the optimization is truly *loop unrolling*, it wouldn't matter what the loop count is. Instead, it is making a selective heuristic decision when to transform a small loop into the equivalent statements. The result is an unrolled loop, but it is not applying the "loop unrolling" optimization on the loop in the general sense. — jxh, Nov 30 '14 at 12:51
@jxh technically it is loop unrolling. Incredible how can you deny this fact. It does not perform loop unrolling for higher counts, because as documentation states loop unrolling can make code run faster or not, so it's not worth to perform that optimization for every single loop in expense of larger binary. But it's beneficial for small loops. — mip, Nov 30 '14 at 12:57
If the compiler only knows how to do loop unrolling for counted loops (which your post claims), it shouldn't matter what the count is. You now admit that it does. However, the loop unrolling optimization doesn't care what the count is. The reason why the compiler does not enable loop unrolling by default is because it actually needs to profile the code to tell if the general loop unrolling optimization will actually be faster. There are ways to feed profiling information back to the compiler to make its optimization decisions better. — jxh, Nov 30 '14 at 12:58
Anyway, I am too tired to explain it anymore. You will just have to believe what you want. — jxh, Nov 30 '14 at 13:02
@jxh from where have you derived claim that it doesn't matter what the count is? Who said that decision to unroll loop or not can't depend on the loop count? And when it depends on loop count it means that it's not loop unrolling? That's ridiculous. I'm comparing counted loop with iterated loop for the same number of iterations. Actually gcc will perform loop unrolling for iterated loop as well, if object is local. — mip, Nov 30 '14 at 13:13

Can modern compilers unroll `for` loops expressed using begin and end iterators

4 Answers4