failure to optimize seemingly obvious loop invariant (but volatile qualifier did magic)

Question

Godbolt Link: https://godbolt.org/g/Hv6MAL

typedef int cell;

cell y;
const cell *phys_addr = (const cell*)0x12340;
int main() {
    for (int i = 0; i < 20; i++) {
        for (int j = 0; j < 30; j++) {
            for (int k = 0; k < 50; k++) {
                const cell *subarray = (&phys_addr[i] + phys_addr[i]/sizeof(cell));
                const cell *subsubarray = (&subarray[j] + subarray[j]/sizeof(cell));
                y = subsubarray[k];
            }
        } 
    }
}

It feels natural to expect the compiler to optimize the above code to something similar to:

int main() {
    for (int i = 0; i < 20; i++) {
        const cell *subarray = (&phys_addr[i] + phys_addr[i]/sizeof(cell));
        for (int j = 0; j < 30; j++) {
            const cell *subsubarray = (&subarray[j] + subarray[j]/sizeof(cell));
            for (int k = 0; k < 50; k++) {
                y = subsubarray[k];
            }
        } 
    }
}

but the assembly generated by gcc 8.2 with -O3 -m32 as flags is:

  push ebp
  push edi
  push esi
  push ebx
  sub esp, 8
  mov eax, DWORD PTR phys_addr
  mov DWORD PTR [esp], 0
  mov DWORD PTR [esp+4], eax
  mov ebp, eax
.L4:
  xor esi, esi
.L3:
  lea edi, [0+esi*4]
  xor eax, eax
.L2:
  mov edx, DWORD PTR [ebp+0]
  mov ecx, DWORD PTR [esp+4]
  shr edx, 2
  add edx, DWORD PTR [esp]
  lea ebx, [ecx+edx*4]
  lea edx, [eax+esi]
  add eax, 1
  mov ecx, DWORD PTR [ebx+edi]
  shr ecx, 2
  add edx, ecx
  mov edx, DWORD PTR [ebx+edx*4]
  mov DWORD PTR y, edx
  cmp eax, 50
  jne .L2
  add esi, 1
  cmp esi, 30
  jne .L3
  add DWORD PTR [esp], 1
  mov eax, DWORD PTR [esp]
  add ebp, 4
  cmp eax, 20
  jne .L4
  add esp, 8
  xor eax, eax
  pop ebx
  pop esi
  pop edi
  pop ebp
  ret

Why isn't the compiler moving the subarray and subsubarray calculation outside the inner loops?

random `volatile` does magic

I randomly added volatile to prevent DCE from getting rid of all the code and then somehow the loop invariants got hoisted out of the inner loops.

int main() {
    for (int i = 0; i < 20; i++) {
        for (int j = 0; j < 30; j++) {
            for (int k = 0; k < 50; k++) {
                const cell *subarray = (&phys_addr[i] + phys_addr[i]/sizeof(cell));
                const cell *subsubarray = (&subarray[j] + subarray[j]/sizeof(cell));
                volatile cell y = subsubarray[k];
            }
        } 
    }
    return 0;
}

This mostly wasn't because of y being a local variable since using std::cout << subsubarray[k]; prevented the optimization.

The assembly generated by gcc 8.2 with -O3 -m32 as flags for the aforementioned code is:

main:
  push ebp
  push edi
  xor edi, edi
  push esi
  push ebx
  sub esp, 20
  mov ebp, DWORD PTR phys_addr
.L4:
  mov eax, DWORD PTR [ebp+0+edi*4]
  xor ecx, ecx
  shr eax, 2
  add eax, edi
  lea ebx, [ebp+0+eax*4]
  lea esi, [ebx+200]
.L3:
  mov edx, DWORD PTR [ebx+ecx*4]
  mov DWORD PTR [esp], ecx
  shr edx, 2
  add edx, ecx
  sal edx, 2
  lea eax, [ebx+edx]
  add edx, esi
.L2:
  mov ecx, DWORD PTR [eax]
  add eax, 4
  mov DWORD PTR [esp+16], ecx
  cmp edx, eax
  jne .L2
  mov ecx, DWORD PTR [esp]
  add ecx, 1
  cmp ecx, 30
  jne .L3
  add edi, 1
  cmp edi, 20
  jne .L4
  add esp, 20
  xor eax, eax
  pop ebx
  pop esi
  pop edi
  pop ebp
  ret

The loop invariants are pushed out of the inner loops. What did the random volatile do to allow the GCC to optimize the invariants? The optimization does not happen when clang 6.0.0.

Did you actually measure which version was faster? Did you measure against a hand-coded piece of assembly you think is fastest? We're talking about optimizations without timings. That's like talking about spaghetti without pasta. There's no meat in the discussion. — rubenvb, Aug 06 '18 at 11:04
@rubenvb http://quick-bench.com/-0DqvKTgQTIoD7J0QcI-wAMmH2U pushing the invariants out appears to make the code 2.9 times faster — Yashas, Aug 06 '18 at 11:20
Cool site! I'm just guessing here but the naive implementation using std containers applied to model whatever magic you're doing with the array will probably perform quite well (I'm thinking std::array> or something similar, and then looping over these with iterators). I can't figure out whatever it is you're doing in that code, so I'm not going to attempt a bad replication of your voodoo magic code to see if I'm right. — rubenvb, Aug 06 '18 at 11:27
The host program I am working with has an unusual way of storing multi-dimensional arrays. It maintains an indirection table (a 1D array) before the actual data. The indirection table tells where the subarrays are stored. To obtain the address of the `n`th subarray, the value of the `n`th item in the indirection table array must be added to its address. This calculated address could point to the indirection table of the subarray if not 1D or to the subarray data itself if the subarray is 1D. Attempts at processing an array obtained from the host in C++ is the source of all that weird code. — Yashas, Aug 06 '18 at 11:41

score 2 · Accepted Answer · answered Aug 07 '18 at 23:50

It is not about random volatile fixing your problem - problem is deeper.

As you already guessed problem indeed relates to "y"

Check this example:

typedef int cell;

const cell *phys_addr = (const cell*)0x12340;

int main() {
    cell y = 1;
    for (int i = 0; i < 20; i++) {
        for (int j = 0; j < 30; j++) {
            for (int k = 0; k < 50; k++) {
                const cell *subarray = (&phys_addr[i] + phys_addr[i]/sizeof(cell));
                const cell *subsubarray = (&subarray[j] + subarray[j]/sizeof(cell));
                y /= subsubarray[k];
            }
        } 
    }
    return y;
}

I've used trick with division to avoid hard optimization (gcc can evaluate all loops and provide y directly in plain assignment; when using add, sub or multiply it will unroll innermost loop too - please play in godbolt to see how it looks)

Now disassembly looks like that: https://godbolt.org/g/R1EGSb

main:
  push ebp
  push edi
  push esi
  push ebx
  sub esp, 12
  mov eax, DWORD PTR phys_addr
  mov DWORD PTR [esp], 0
  mov DWORD PTR [esp+4], eax
  mov eax, 1
.L4:
  mov esi, DWORD PTR [esp]
  mov edi, DWORD PTR [esp+4]
  mov edx, DWORD PTR [edi+esi*4]
  mov DWORD PTR [esp+8], edx
  shr edx, 2
  add edx, esi
  xor esi, esi
  lea edi, [edi+edx*4]
  lea ebp, [edi+200]
.L3:
  mov ebx, DWORD PTR [edi+esi*4]
  shr ebx, 2
  add ebx, esi
  sal ebx, 2
  lea ecx, [edi+ebx]
  add ebx, ebp
.L2:
  cdq
  idiv DWORD PTR [ecx]
  add ecx, 4
  cmp ebx, ecx
  jne .L2
  add esi, 1
  cmp esi, 30
  jne .L3
  add DWORD PTR [esp], 1
  mov edi, DWORD PTR [esp]
  cmp edi, 20
  jne .L4
  add esp, 12
  pop ebx
  pop esi
  pop edi
  pop ebp
  ret
phys_addr:
  .long 74560

.L2 is innermost loop so code looks like expected - subarray and subsubarray are precomputed earlier.

So you may wonder - why is that when "y" is local all is ok and when global is not.

To be clear - "y" does not have to be declared in main. It could be made static like this

static cell y;
const cell * __restrict__ phys_addr = (const cell*)0x12340;

or use namespace

namespace wtf{ cell y; }
const cell * __restrict__ phys_addr = (const cell*)0x12340;

and than refer to y as wtf::y;

Still good.

All condenses to aliasing. To see it let's change y to pointer first:

typedef int cell;

cell *  y;
const cell *  phys_addr = (const cell*)0x12340;

int main() {
    cell ylocal;
    y = &ylocal;
    for (int i = 0; i < 20; i++) {
        for (int j = 0; j < 30; j++) {
            for (int k = 0; k < 50; k++) {
                const cell *subarray = (&phys_addr[i] + phys_addr[i]/sizeof(cell));
                const cell *subsubarray = (&subarray[j] + subarray[j]/sizeof(cell));
                *y /= subsubarray[k];
            }
        } 
    }
    return *y;
}

No loop optimization again....

It may be assumed that y and phys_addr overlap - writing y may modify some memory cells so all dictionaries have to be calculated with most up to date data (const in phys_addr means only your pointer should not modify memory, not that it is globally readonly).

But if you "promise" that those addresses do not overlap optimization will come back.

typedef int cell;

cell * __restrict__ y;
const cell * __restrict__ phys_addr = (const cell*)0x12340;

int main() {
    cell ylocal;
    y = &ylocal;
    for (int i = 0; i < 20; i++) {
        for (int j = 0; j < 30; j++) {
            for (int k = 0; k < 50; k++) {
                const cell *subarray = (&phys_addr[i] + phys_addr[i]/sizeof(cell));
                const cell *subsubarray = (&subarray[j] + subarray[j]/sizeof(cell));
                *y /= subsubarray[k];
            }
        } 
    }
    return *y;
}

TL;DR;

If you are using pointers compilator may not be able to prove addresses do not alias and will use safe path. If you are 100% sure they don't use restrict to inform it about that fact.

failure to optimize seemingly obvious loop invariant (but volatile qualifier did magic)

random volatile does magic

1 Answers1

random `volatile` does magic