Why do some C compilers set the return value of a function in weird places?

Question

I wrote this snippet in a recent argument over the supposed speed of array[i++] vs array[i]; i++.

int array[10];

int main(){
    int i=0;
    while(i < 10){
        array[i] = 0;
        i++;
    }
    return 0;
}

Snippet at the compiler explorer: https://godbolt.org/g/de7TY2

As expected, the compiler output identical asm for array[i++] and array[i]; i++ with at least -O1. However what surprised me was the placement of the xor eax, eax seemingly randomly in the function at higher optimization levels.

GCC

At -O2, GCC places the xor before the ret as expected

    mov     DWORD PTR [rax], 0
    add     rax, 4
    cmp     rax, OFFSET FLAT:array+40
    jne     .L2
    xor     eax, eax
    ret

However it places the xor after the second mov at -O3

    mov     QWORD PTR array[rip], 0
    mov     QWORD PTR array[rip+8], 0
    xor     eax, eax
    mov     QWORD PTR array[rip+16], 0
    mov     QWORD PTR array[rip+24], 0
    mov     QWORD PTR array[rip+32], 0
    ret

icc

icc places it normally at -O1:

    push      rsi
    xor       esi, esi
    push      3
    pop       rdi
    call      __intel_new_feature_proc_init
    stmxcsr   DWORD PTR [rsp]
    xor       eax, eax
    or        DWORD PTR [rsp], 32832
    ldmxcsr   DWORD PTR [rsp]
..B1.2:
    mov       DWORD PTR [array+rax*4], 0
    inc       rax
    cmp       rax, 10
    jl        ..B1.2
    xor       eax, eax
    pop       rcx
    ret

but in a strange place at -O2

    push      rbp
    mov       rbp, rsp
    and       rsp, -128
    sub       rsp, 128
    xor       esi, esi
    mov       edi, 3
    call      __intel_new_feature_proc_init
    stmxcsr   DWORD PTR [rsp]
    pxor      xmm0, xmm0
    xor       eax, eax
    or        DWORD PTR [rsp], 32832
    ldmxcsr   DWORD PTR [rsp]
    movdqu    XMMWORD PTR array[rip], xmm0
    movdqu    XMMWORD PTR 16+array[rip], xmm0
    mov       DWORD PTR 32+array[rip], eax
    mov       DWORD PTR 36+array[rip], eax
    mov       rsp, rbp
    pop       rbp
    ret

and -O3

    and       rsp, -128
    sub       rsp, 128
    mov       edi, 3
    call      __intel_new_proc_init
    stmxcsr   DWORD PTR [rsp]
    xor       eax, eax
    or        DWORD PTR [rsp], 32832
    ldmxcsr   DWORD PTR [rsp]
    mov       rsp, rbp
    pop       rbp
    ret

Clang

only clang places the xor directly in front of the ret at all optimization levels:

    xorps   xmm0, xmm0
    movaps  xmmword ptr [rip + array+16], xmm0
    movaps  xmmword ptr [rip + array], xmm0
    mov     qword ptr [rip + array+32], 0
    xor     eax, eax
    ret

Since GCC and ICC both do this at higher optimisation levels, I presume there must be some kind of good reason.

Why do some compilers do this?

The code is semantically identical of course and the compiler can reorder it as it wishes, but since this only changes at higher optimization levels this must be caused by some kind of optimization.

Like.. why not? If the code is semantically correct, the compiler can produce a code as weird as it wishes as long as it is optimal. Readability is not a requirement. — Eugene Sh., Aug 10 '17 at 20:06
Well, if it outputs weird code somebody must have written it to do that and presumably for a good reason. Why make your life harder and change the order of instructions if there is no good reason? — Azsgy, Aug 10 '17 at 20:09
There are algorithms which might have an output saying something like "place `xor ax, ax` *somewhere* between lines X and Y". Then some other algorithm will take this output and put it after `X` because it is traversing the lines from X to Y. Another algorithm will put it before Y, because it is traversing it in the opposite direction. (very abstract and simplified, of course..) — Eugene Sh., Aug 10 '17 at 20:14
Different optimization levels have different goals. Some levels have a goal of minimization of execution time. Some levels have goal of minimization of code size. Some levels have a mixed goal. Also, modern processors often, (usually) have a number of duplicate execution units, So the compiler has a choice of goals, of execution units, etc. Different compilers are written by different teams and each team has their own objectives. — user3629249, Aug 12 '17 at 02:51

score 7 · Answer 1 · answered Aug 10 '17 at 20:14

7

Since eax isn't used, compilers can zero the register whenever they want, and it works as expected.

An interesting thing that you didn't notice is the icc -O2 version:

xor       eax, eax
or        DWORD PTR [rsp], 32832
ldmxcsr   DWORD PTR [rsp]
movdqu    XMMWORD PTR array[rip], xmm0
movdqu    XMMWORD PTR 16+array[rip], xmm0
mov       DWORD PTR 32+array[rip], eax   ; set to 0 using the value of eax
mov       DWORD PTR 36+array[rip], eax

notice that eax is zeroed for the return value, but also used to zero 2 memory regions (last 2 instructions), probably because the instruction using eax is shorter than the instruction with the immediate zero operand.

So two birds with one stone.

answered Aug 10 '17 at 20:14

Jean-François Fabre

137,073
23
153
219

Thank you, I didn't see this and it certainly explains why ICC is setting the eax register early, so it can use it for loading 0 later. I'm curious why GCC does this. – Azsgy Aug 10 '17 at 20:18
3

Not sure and not an expert, but maybe doing that in the middle of some same instruction sequence could make a better use of the instruction pipeline (since they're not working on the same data, registers vs memory). – Jean-François Fabre Aug 10 '17 at 20:20

score 6 · Accepted Answer · edited Sep 29 '19 at 02:17

6

Different instructions have different latencies. Sometimes changing the order of instructions can speed up the code for several reasons. For example: If a certain instruction takes several cycles to complete, if it is at the end of the function the program just waits until it is done. If it is earlier in the function other things can happen while that instruction finishes. That is unlikely the actual reason here, though, on second thought, as xor of registers is I believe a low-latency instruction. Latencies are processor dependent though.

However, placing the XOR there may have to do with separating the mov instructions between which it is placed.

There are also optimizations that take advantage of the optimization capabilities of modern processors such as pipelining, branch prediction (not the case here as far as I can see....), etc. You need a pretty deep understanding of these capabilities to understand what an optimizer may do to take advantage of them.

You might find this informative. It pointed me to Agner Fog's site, a resource I have not seen before but has a lot of the information you wanted (or didn't want :-) ) to know but were afraid to ask :-)

edited Sep 29 '19 at 02:17

Peter Cordes

328,167
45
605
847

answered Aug 10 '17 at 20:27

Basya

1,477
1
12
22

1

The general idea actually seems precisely the point in this case: The `xor` instruction is mixed with memory accesses, which use different CPU resources, and which have high latency. As such, placing the `xor` right in the middle of these instructions guarantees that it's executed with zero overhead. – cmaster - reinstate monica Aug 10 '17 at 20:39
I like the way you explained that, maybe a bit more clearly than I did.... – Basya Aug 10 '17 at 20:40
2

Scheduling for latency is not super useful in the OoOE age though. And a self-xor doesn't even have a well-defined latency since it depends on nothing. – harold Aug 10 '17 at 20:43
@harold Even the most aggressive OoOE-processors have a limited execution window, and if they can't find an executable instruction within that window, they stall until some of the instructions become executable. If you have a large enough block of memory accesses followed by a large enough block of integer arithmetic instructions, the CPU won't be able to execute all the arithmetic instructions while it's waiting for the memory instructions to complete, simply because they are not visible yet. So, mixing different types of instructions is virtually always a good idea for a compiler. – cmaster - reinstate monica Aug 11 '17 at 09:10
Pipelining isn't really "modern", it's been out there forever. Even many tiny microcontrollers implement it. – Lundin Aug 11 '17 at 11:27
@cmaster limited is still big, extremely local scheduling like this does nothing – harold Aug 11 '17 at 12:42
1

@harold I'd agree that there is probably no performance difference in the given code. However, when it comes to handling code that does more, the compiler that produces well-mixed output is the compiler that produces the faster code. Or, in other words: It's just a reflex of a good compiler to mix its output well. Also, the compiler has no knowledge or control over how the function is called. If it places the `xor` right in the middle, it knows that it won't consume a cycle of its own. With the `xor` at the end, it may cause an extra cycle for some future instruction at the caller site. – cmaster - reinstate monica Aug 11 '17 at 12:54
1

@cmaster OK sure, it's not like scheduling is completely dead, but what's relevant (and what you seem to be describing) is scheduling for throughput. – harold Aug 11 '17 at 13:04
@Lundin: what a person calls "modern" is relative, I guess :-) – Basya Aug 12 '17 at 21:07

score 3 · Answer 3 · answered Aug 11 '17 at 13:28

Those memory accesses are expected to burn at least several clock cycles. You can move the xor without changing the functionality of the code. By pulling it back with one/some memory accesses after it it becomes free, doesnt cost you any execution time it is parallel with the external access (the processor finishes the xor and waits on the external activity rather than just waits on the external activity). If you put it in a clump of instructions without memory accesses it costs a clock at least. And as you probably know using the xor vs mov immediate reduces the size of the instruction, probably not costing clocks but saving space in the binary. A ghee whiz kinda cool optimization that dates back to the original 8086, and is still used today even if it doesnt save you much in the end.

0___________ · Answer 4 · 2017-08-10T20:55:10.687

-1

Where processor set the particular value depends on the moment where passing the execution tree it is sure that this register will not be needed anymore and will not be changed by the external world.

Here is less trivial example: https://godbolt.org/g/6AowMJ

And the processor zeroes eax past the memset because memset can change its value. The moment depends on parsing the complex tree, and it possible not logical for the humans.

edited Aug 10 '17 at 20:55

answered Aug 10 '17 at 20:10

0___________

60,014
4
34
74

3

You seem to misunderstand the question. I am not asking why it is unrolling the loop, I am asking why it is setting the return value of the function in the middle of the function. – Azsgy Aug 10 '17 at 20:12
Ammended the answer – 0___________ Aug 10 '17 at 20:42

Why do some C compilers set the return value of a function in weird places?

GCC

icc

Clang

4 Answers4