I have an array of POD structs and am trying to sum across one field. Here's a minimal example:
struct Item
{
int x = 0;
int y = 0;
};
typedef Item Items[2];
struct ItemArray
{
Items items;
int sum_x1() const;
int sum_x2() const;
};
int ItemArray::sum_x1() const
{
int total = 0;
for (unsigned ii = 0; ii < 2; ++ii)
{
total += items[ii].x;
}
return total;
}
int ItemArray::sum_x2() const
{
int total = 0;
for (const Item& item : items)
{
total += item.x;
}
return total;
}
The two sum functions do the same thing. Clang compiles them identically. But GCC 6 with -O3
on x86_64 does not. Here's sum_x1()
, looking good:
mov eax, DWORD PTR [rdi+8]
add eax, DWORD PTR [rdi]
ret
Now look at sum_x2()
:
lea rdx, [rdi+16]
lea rcx, [rdi+8]
xor eax, eax
add eax, DWORD PTR [rdi]
cmp rdx, rcx
je .L12
lea rcx, [rdi+16]
add eax, DWORD PTR [rdi+8]
cmp rdx, rcx
je .L2
lea rcx, [rdi+24]
add eax, DWORD PTR [rdi+16]
cmp rdx, rcx
je .L2
lea rcx, [rdi+32]
add eax, DWORD PTR [rdi+24]
cmp rdx, rcx
je .L2
lea rcx, [rdi+40]
add eax, DWORD PTR [rdi+32]
cmp rdx, rcx
je .L2
lea rcx, [rdi+48]
add eax, DWORD PTR [rdi+40]
cmp rdx, rcx
je .L2
lea rcx, [rdi+56]
add eax, DWORD PTR [rdi+48]
cmp rdx, rcx
je .L2
lea rcx, [rdi+64]
add eax, DWORD PTR [rdi+56]
cmp rdx, rcx
je .L2
lea rcx, [rdi+72]
add eax, DWORD PTR [rdi+64]
cmp rdx, rcx
je .L2
add eax, DWORD PTR [rdi+72]
ret
.L2:
rep ret
.L12:
rep ret
Why does GCC emit an unrolled loop of variable length up to 10 when there are the loop length is fixed at 2? It only does this in a member function--making sum_x2
a free function fixes it.
ICC also optimizes sum_x2()
very strangely, though the generated code is totally different. Unlike GCC, it doesn't matter whether sum_x2()
is a member function or a free function--both are bad.
I'm using GCC 6, but all versions of GCC seem to have problems with this code. Adding -march=haswell
makes it even worse, adding iterations for up to 15 elements in the array of size 2. GCC 5 and 7 generate even more complex code, adding SIMD instructions.
I would like to identify the exact cause of this problem, so that I can locate and fix similar occurrences in my code. Understanding what triggers this behavior in GCC 6 would be very helpful. I have lots of range-based for loops in my code, and I am not too excited at the prospect of removing them, but if GCC can't generate reasonable code I will have no choice.
Try it: https://godbolt.org/g/9GK4jy
More related insanity: https://godbolt.org/g/BGYggD (optimal code is 3 instructions; GCC 6 produces 8 instructions; GCC 7 produces 130 instructions)