How llvm does O2 optimisation

Question

I am trying to see the IR of a very simple loop

for (int i = 0; i < 15; i++){
  a[b[i]]++;
}

while compile using -O0 and diving into the .ll file, I can see instructions written step by step in the define i32 @main() function. However, while compiling using -O2 and looking into the .ll file, there is only ret i32 0 in the define i32 @main() function. And some call instruction presented in the .ll file compiled by -O0 are changed to tail call in the .ll file compiled by -O2.

Can anyone give a rather detailed explanation on how llvm does the -O2 compilation? Thanks.

T

If you want to see the optimizations step-by-step, try: `clang -mllvm -print-after-all` — Joky, Feb 20 '17 at 05:00

Wander Nauta · Accepted Answer · 2017-02-03T17:19:57.590

We can use the Compiler Explorer at godbolt.org to look at your example. We'll use the following testbench code:

int test() {
  int a[15] = {0};
  int b[15] = {0};

  for (int i = 0; i < 15; i++){
    a[b[i]]++;
  }

  return 0;
}

Godbolt shows the x86 assembly, not the LLVM bytecode, but I've summarized it a bit to show what's going on. Here it is at -O0 -m32:

test():                              
        # set up stack
.LBB0_1:                                
        cmp     dword ptr [ebp - 128], 15           # i < 15?
        jge     .LBB0_4                             # no? then jump out of loop
        mov     eax, dword ptr [ebp - 128]          # load i
        mov     eax, dword ptr [ebp + 4*eax - 124]  # load b[i]
        mov     ecx, dword ptr [ebp + 4*eax - 64]   # load a[b[i]]
        add     ecx, 1                              # increment it
        mov     dword ptr [ebp + 4*eax - 64], ecx   # store it back
        mov     eax, dword ptr [ebp - 128]
        add     eax, 1                              # increment i
        mov     dword ptr [ebp - 128], eax
        jmp     .LBB0_1                             # repeat
.LBB0_4:
        # tear down stack
        ret

This looks like we'd expect: the loop is clearly visible and it does all the steps we listed. If we compile at -O1 -m32 -march=i386, we see the loop is still there but it's much simpler:

test():                               # @test()
        # set up stack
.LBB0_1:                               
        mov     ecx, dword ptr [esp + 4*eax]    # load b[i]
        inc     dword ptr [esp + 4*ecx + 60]    # increment a[b[i]]
        inc     eax                             # increment i
        cmp     eax, 15                         # compare == 15
        jne     .LBB0_1                         # no? then loop
        # tear down stack
        ret

Clang now uses the inc instruction (useful), noticed it could use the eax register for the loop counter i (neat), and moved the condition check to the bottom of the loop (probably better). We can still recognize our original code, though. Now let's try with -O2 -m32 -march=i386:

test():                               
        xor     eax, eax    # does nothing
        ret

That's it? Yes.

clang has detected that the a array can never be used outside of the function. This means doing the incrementing will never affect any other part of the program - and also that nobody will miss it when it's gone.

Removing the increment leaves a for loop with an empty body and no side effects, which can also be removed. In turn, removing the loop leaves an (for all intents and purposes) empty function.

This empty function is likely what you were seeing in the LLVM bytecode (ret i32 0).

This is not a very scientific description, and the steps clang takes might be different, but I hope the example clears it up a bit. If you want, you can read up on the as-if rule. I also recommend playing around on https://godbolt.org/ for a bit: see what happens when you move a and b outside the function, for example.

How llvm does O2 optimisation

1 Answers1