GCC optimization trick, does it really work?

Question

While looking at some questions on optimization, this accepted answer for the question on coding practices for most effective use of the optimizer piqued my curiosity. The assertion is that local variables should be used for computations in a function, not output arguments. It was suggested this would allow the compiler to make additional optimizations otherwise not possible.

So, writing a simple bit of code for the example Foo class and compiling the code fragments with g++ v4.4 and -O2 gave some assembler output (use -S). The parts of the assembler listing with just the loop portion shown below. On examination of the output, it seems the loop is nearly identical for both, with just a difference in one address. That address being a pointer to the output argument for the first example or the local variable for the second.

There seems to no change in the actual effect whether the local variable is used or not. So the question breaks down to 3 parts:

a) is GCC not doing additional optimization, even given the hint suggested;

b) is GCC successfully optimizing in both cases, but should not be;

c) is GCC successfully optimizing in both cases, and is producing compliant output as defined by the C++ standard?

Here is the unoptimized function:

void DoSomething(const Foo& foo1, const Foo* foo2, int numFoo, Foo& barOut)
{
    for (int i=0; i<numFoo, i++)
    {
         barOut.munge(foo1, foo2[i]);
    }
}

And corresponding assembly:

.L3:
    movl    (%esi), %eax
    addl    $1, %ebx
    addl    $4, %esi
    movl    %eax, 8(%esp)
    movl    (%edi), %eax
    movl    %eax, 4(%esp)
    movl    20(%ebp), %eax       ; Note address is that of the output argument
    movl    %eax, (%esp)
    call    _ZN3Foo5mungeES_S_
    cmpl    %ebx, 16(%ebp)
    jg      .L3

Here is the re-written function:

void DoSomethingFaster(const Foo& foo1, const Foo* foo2, int numFoo, Foo& barOut)
{
    Foo barTemp = barOut;
    for (int i=0; i<numFoo, i++)
    {
         barTemp.munge(foo1, foo2[i]);
    }
    barOut = barTemp;
}

And here is the compiler output for the function using a local variable:

.L3:
    movl    (%esi), %eax          ; Load foo2[i] pointer into EAX
    addl    $1, %ebx              ; increment i
    addl    $4, %esi              ; increment foo2[i] (32-bit system, 8 on 64-bit systems)
    movl    %eax, 8(%esp)         ; PUSH foo2[i] onto stack (careful! from EAX, not ESI)
    movl    (%edi), %eax          ; Load foo1 pointer into EAX
    movl    %eax, 4(%esp)         ; PUSH foo1
    leal    -28(%ebp), %eax       ; Load barTemp pointer into EAX
    movl    %eax, (%esp)          ; PUSH the this pointer for barTemp
    call    _ZN3Foo5mungeES_S_    ; munge()!
    cmpl    %ebx, 16(%ebp)        ; i < numFoo
    jg      .L3                   ; recall incrementing i by one coming into the loop
                                  ; so test if greater

The linked to question was community wiki, so if anyone wants to make this one also, please do so. I cannot see a way to do it. — casualcoder, Oct 14 '11 at 00:05
What is the source code we are talking about here? Could you post the code and maybe mark what part of the code corresponds to the assembler excerpts you posted? — Janick Bernet, Oct 14 '11 at 00:11
I've moved the code from the linked question into this one, but _please_ replace it if you used different code. It _looks_ about the same, but specifics make all the difference. — sarnold, Oct 14 '11 at 00:44
IMO, by using only one compiler and only one platform you're wandering off to a "dark side" of bad habits. You're mentioning only GCC, but there are other compilers out there, and there are different cpus. If they DON'T behave the same way as GCC does, your "optimization" will be a waste of time. I'd advise to stick with C++ standard and perform low-level "optimization" like this only if situation demands it (use profilers to check that), but even then you could use assembly to *enforce* optimization you want instead of trying to predict what optimizer is going to do with your code. — SigTerm, Oct 14 '11 at 00:59
@SigTerm, the question is not whether this optimization is made for GCC, but whether GCC made use of this optimization. Confusingly similar, but not the same. Any comparison with other compilers is welcome! — casualcoder, Oct 14 '11 at 02:04

Raymond Chen · Accepted Answer · 2011-10-15T13:29:50.920

The example given in that answer was not a very good one because of the call to an unknown function the compiler cannot reason much about. Here's a better example:

void FillOneA(int *array, int length, int& startIndex)
{
    for (int i = 0; i < length; i++) array[startIndex + i] = 1;
}

void FillOneB(int *array, int length, int& startIndex)
{
    int localIndex = startIndex;
    for (int i = 0; i < length; i++) array[localIndex + i] = 1;
}

The first version optimizes poorly because it needs to protect against the possibility that somebody called it as

int array[10] = { 0 };
FillOneA(array, 5, array[1]);

resulting in {1, 1, 0, 1, 1, 1, 0, 0, 0, 0 } since the iteration with i=1 modifies the startIndex parameter.

The second one doesn't need to worry about the possibility that the array[localIndex + i] = 1 will modify localIndex because localIndex is a local variable whose address has never been taken.

In assembly (Intel notation, because that's what I use):

FillOneA:
    mov     edx, [esp+8]
    xor     eax, eax
    test    edx, edx
    jle     $b
    push    esi
    mov     esi, [esp+16]
    push    edi
    mov     edi, [esp+12]
$a: mov     ecx, [esi]
    add     ecx, eax
    inc     eax
    mov     [edi+ecx*4], 1
    cmp     eax, edx
    jl      $a
    pop     edi
    pop     esi
$b: ret

FillOneB:
    mov     ecx, [esp+8]
    mov     eax, [esp+12]
    mov     edx, [eax]
    test    ecx, ecx
    jle     $a
    mov     eax, [esp+4]
    push    edi
    lea     edi, [eax+edx*4]
    mov     eax, 1
    rep stosd
    pop     edi
$a: ret

ADDED: Here's an example where the compiler's insight is into Bar, and not munge:

class Bar
{
public:
    float getValue() const
    {
        return valueBase * boost;
    }

private:
    float valueBase;
    float boost;
};

class Foo
{
public:
    void munge(float adjustment);
};

void Adjust10A(Foo& foo, const Bar& bar)
{
    for (int i = 0; i < 10; i++)
        foo.munge(bar.getValue());
}

void Adjust10B(Foo& foo, const Bar& bar)
{
    Bar localBar = bar;
    for (int i = 0; i < 10; i++)
        foo.munge(localBar.getValue());
}

The resulting code is

Adjust10A:
    push    ecx
    push    ebx
    mov     ebx, [esp+12] ;; foo
    push    esi
    mov     esi, [esp+20] ;; bar
    push    edi
    mov     edi, 10
$a: fld     [esi+4] ;; bar.valueBase
    push    ecx
    fmul    [esi] ;; valueBase * boost
    mov     ecx, ebx
    fstp    [esp+16]
    fld     [esp+16]
    fstp    [esp]
    call    Foo::munge
    dec     edi
    jne     $a
    pop     edi
    pop     esi
    pop     ebx
    pop     ecx
    ret     0

Adjust10B:
    sub     esp, 8
    mov     ecx, [esp+16] ;; bar
    mov     eax, [ecx] ;; bar.valueBase
    mov     [esp], eax ;; localBar.valueBase
    fld     [esp] ;; localBar.valueBase
    mov     eax, [ecx+4] ;; bar.boost
    mov     [esp+4], eax ;; localBar.boost
    fmul    [esp+4] ;; localBar.getValue()
    push    esi
    push    edi
    mov     edi, [esp+20] ;; foo
    fstp    [esp+24]
    fld     [esp+24] ;; cache localBar.getValue()
    mov     esi, 10 ;; loop counter
$a: push    ecx
    mov     ecx, edi ;; foo
    fstp    [esp] ;; use cached value
    call    Foo::munge
    fld     [esp]
    dec     esi
    jne     $a ;; loop
    pop     edi
    fstp    ST(0)
    pop     esi
    add     esp, 8
    ret     0

Observe that the inner loop in Adjust10A must recalculate the value since it must protect against the possibility that foo.munge changed bar.

That said, this style of optimization is not a slam dunk. (For example, we could've gotten the same effect by manually caching bar.getValue() into localValue.) It tends to be most helpful for vectorized operations, since those can be paralellized.

I get similar results for your example. It seems that for this optimization trick to work, the compiler must have clear access to all the code being optimized. In conclusion, the answer to my question is a), GCC is not using the optimization, and probably is not expected to. — casualcoder, Oct 14 '11 at 02:43
You are more likely to see the optimization have an effect if the object being cached into a local variable fits in a register. — Raymond Chen, Oct 14 '11 at 13:18

bdonlan · Answer 2 · 2011-10-14T01:01:44.847

First, I'm going to assume munge() cannot be inlined - that is, its definition is not in the same translation unit; you haven't provided complete source, so I can't be entirely sure, but it would explain these results.

Since foo1 is passed to munge as a reference, at the implementation level, the compiler just passes a pointer. If we just forward our argument, this is nice and fast - any aliasing issues are munge()'s problem - and have to be, since munge() can't assume anything about its arguments, and we can't assume anything about what munge() might do with them (since munge()'s definition is not available).

However, if we copy to a local variable, we must copy to a local variable and pass a pointer to the local variable. This is because munge() can observe a difference in behavior - if it takes the pointer to its first argument, it can see it's not equal to &foo1. Since munge()'s implementation is not in scope, the compiler can't assume it won't do this.

This local-variable-copy trick here thus ends up pessimizing, rather than optimizing - the optimizations it tries to help are not possible, because munge() cannot be inlined; for the same reason, the local variable actively hurts performance.

It would be instructive to try this again, making sure munge() is non-virtual and available as an inlinable function.

I would also say there is a major difference between a function that can be inlined and one where the this pointer will have to be passed to, generally through the stack. — Janick Bernet, Oct 14 '11 at 01:05
@inflagranti, even if it can't be inlined (eg, due to being too large), if the definition is available, the compiler may be able to optimize based on its behavior — bdonlan, Oct 14 '11 at 01:19
My implementation of Foo is fairly trivial, so the optimizer has its way with it when inlined. However, it is clear that for the code I have, that the optimized code saves a load and store of the result variable in the loop when munge() is inlined. So, it seems that if the compiler has to call a function without being able to inline it, then the optimization given does nothing useful. — casualcoder, Oct 14 '11 at 02:36

GCC optimization trick, does it really work?

2 Answers2

Linked