While looking at some questions on optimization, this accepted answer for the question on coding practices for most effective use of the optimizer piqued my curiosity. The assertion is that local variables should be used for computations in a function, not output arguments. It was suggested this would allow the compiler to make additional optimizations otherwise not possible.
So, writing a simple bit of code for the example Foo class and compiling the code fragments with g++ v4.4 and -O2 gave some assembler output (use -S). The parts of the assembler listing with just the loop portion shown below. On examination of the output, it seems the loop is nearly identical for both, with just a difference in one address. That address being a pointer to the output argument for the first example or the local variable for the second.
There seems to no change in the actual effect whether the local variable is used or not. So the question breaks down to 3 parts:
a) is GCC not doing additional optimization, even given the hint suggested;
b) is GCC successfully optimizing in both cases, but should not be;
c) is GCC successfully optimizing in both cases, and is producing compliant output as defined by the C++ standard?
Here is the unoptimized function:
void DoSomething(const Foo& foo1, const Foo* foo2, int numFoo, Foo& barOut)
{
for (int i=0; i<numFoo, i++)
{
barOut.munge(foo1, foo2[i]);
}
}
And corresponding assembly:
.L3:
movl (%esi), %eax
addl $1, %ebx
addl $4, %esi
movl %eax, 8(%esp)
movl (%edi), %eax
movl %eax, 4(%esp)
movl 20(%ebp), %eax ; Note address is that of the output argument
movl %eax, (%esp)
call _ZN3Foo5mungeES_S_
cmpl %ebx, 16(%ebp)
jg .L3
Here is the re-written function:
void DoSomethingFaster(const Foo& foo1, const Foo* foo2, int numFoo, Foo& barOut)
{
Foo barTemp = barOut;
for (int i=0; i<numFoo, i++)
{
barTemp.munge(foo1, foo2[i]);
}
barOut = barTemp;
}
And here is the compiler output for the function using a local variable:
.L3:
movl (%esi), %eax ; Load foo2[i] pointer into EAX
addl $1, %ebx ; increment i
addl $4, %esi ; increment foo2[i] (32-bit system, 8 on 64-bit systems)
movl %eax, 8(%esp) ; PUSH foo2[i] onto stack (careful! from EAX, not ESI)
movl (%edi), %eax ; Load foo1 pointer into EAX
movl %eax, 4(%esp) ; PUSH foo1
leal -28(%ebp), %eax ; Load barTemp pointer into EAX
movl %eax, (%esp) ; PUSH the this pointer for barTemp
call _ZN3Foo5mungeES_S_ ; munge()!
cmpl %ebx, 16(%ebp) ; i < numFoo
jg .L3 ; recall incrementing i by one coming into the loop
; so test if greater