GCC optimizes fixed range-based for loop as if it had longer, variable length

Question

I have an array of POD structs and am trying to sum across one field. Here's a minimal example:

struct Item
{
    int x = 0;
    int y = 0;
};

typedef Item Items[2];

struct ItemArray
{
    Items items;

    int sum_x1() const;
    int sum_x2() const;
};

int ItemArray::sum_x1() const
{
    int total = 0;
    for (unsigned ii = 0; ii < 2; ++ii)
    {
        total += items[ii].x;
    }
    return total;
}

int ItemArray::sum_x2() const
{
    int total = 0;
    for (const Item& item : items)
    {
        total += item.x;
    }
    return total;
}

The two sum functions do the same thing. Clang compiles them identically. But GCC 6 with -O3 on x86_64 does not. Here's sum_x1(), looking good:

  mov eax, DWORD PTR [rdi+8]
  add eax, DWORD PTR [rdi]
  ret

Now look at sum_x2():

  lea rdx, [rdi+16]
  lea rcx, [rdi+8]
  xor eax, eax
  add eax, DWORD PTR [rdi]
  cmp rdx, rcx
  je .L12
  lea rcx, [rdi+16]
  add eax, DWORD PTR [rdi+8]
  cmp rdx, rcx
  je .L2
  lea rcx, [rdi+24]
  add eax, DWORD PTR [rdi+16]
  cmp rdx, rcx
  je .L2
  lea rcx, [rdi+32]
  add eax, DWORD PTR [rdi+24]
  cmp rdx, rcx
  je .L2
  lea rcx, [rdi+40]
  add eax, DWORD PTR [rdi+32]
  cmp rdx, rcx
  je .L2
  lea rcx, [rdi+48]
  add eax, DWORD PTR [rdi+40]
  cmp rdx, rcx
  je .L2
  lea rcx, [rdi+56]
  add eax, DWORD PTR [rdi+48]
  cmp rdx, rcx
  je .L2
  lea rcx, [rdi+64]
  add eax, DWORD PTR [rdi+56]
  cmp rdx, rcx
  je .L2
  lea rcx, [rdi+72]
  add eax, DWORD PTR [rdi+64]
  cmp rdx, rcx
  je .L2
  add eax, DWORD PTR [rdi+72]
  ret
.L2:
  rep ret
.L12:
  rep ret

Why does GCC emit an unrolled loop of variable length up to 10 when there are the loop length is fixed at 2? It only does this in a member function--making sum_x2 a free function fixes it.

ICC also optimizes sum_x2() very strangely, though the generated code is totally different. Unlike GCC, it doesn't matter whether sum_x2() is a member function or a free function--both are bad.

I'm using GCC 6, but all versions of GCC seem to have problems with this code. Adding -march=haswell makes it even worse, adding iterations for up to 15 elements in the array of size 2. GCC 5 and 7 generate even more complex code, adding SIMD instructions.

I would like to identify the exact cause of this problem, so that I can locate and fix similar occurrences in my code. Understanding what triggers this behavior in GCC 6 would be very helpful. I have lots of range-based for loops in my code, and I am not too excited at the prospect of removing them, but if GCC can't generate reasonable code I will have no choice.

Try it: https://godbolt.org/g/9GK4jy

More related insanity: https://godbolt.org/g/BGYggD (optimal code is 3 instructions; GCC 6 produces 8 instructions; GCC 7 produces 130 instructions)

I suppose this is the bug in `gcc`. You can repot it here https://gcc.gnu.org/bugzilla/ and mark as missed-optimization — fghj, Aug 04 '17 at 04:57
@fghj: I realize it is a bug in GCC and intend to report it. That's not the question though (see bold text). — John Zwinck, Aug 04 '17 at 06:06
According to my expiriense the probability that `gcc person` when analyze bug will tell you root cause and possible workaround is higher then you get answer about this on SO. By the way, can you please add link to bugreport to your question, after you will create bug report? — fghj, Aug 04 '17 at 06:14
@fghj: I filed GCC bug https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81719 . I'm still hoping for an answer (here or there) regarding what triggers this problem so I can diagnose whether it is likely to be happening elsewhere in my programs. — John Zwinck, Aug 04 '17 at 15:55
In your second example, the non-trivial destructor is the culprit. If you delete it, or replace it with `~FixedArray() noexcept = default;`, then both gcc versions generate 3 instructions. — Praetorian, Aug 04 '17 at 18:29
@Praetorian: I know what you mean, the second link does have a user-defined destructor without which the problem does not arise. However, the existence of such destructor (which is `noexcept`, defined inline, and does nothing) should not change the code required to implement construction. But it does. — John Zwinck, Aug 05 '17 at 01:45
@JohnZwinck Thanks for finding/filing this bug. It seems that they fixed it - anything in the bug report that lead you to be able to self-answer? — Barry, Sep 06 '17 at 16:00
@JohnZwinck Cheers! And thanks again. Time to pour over my entire code base... — Barry, Sep 07 '17 at 01:31

John Zwinck · Accepted Answer · 2017-09-07T03:47:34.457

As described by Richard Biener in my bug report, the problem seems to be that GCC prior to version 8 failed to understand that a field of a class or struct was subject to the same optimizations (e.g. constant loop count) as a regular variable. So it would emit all sorts of fancy code to optimally loop an unknown number of times, even when it was known at compile time, in the case where the container was a member variable.

The way I understand it, this bug probably affects quite a bit of code in the wild--e.g. anywhere a member small array is the subject of a C++11 range-based for loop.

Thanks to Richard Biener for the prompt resolution (targeted for GCC 8).

GCC optimizes fixed range-based for loop as if it had longer, variable length

1 Answers1

Linked