Effectively passing a struct
by reference even when the function declaration indicates pass-by-value is a common optimization: it's just that it usually happens indirectly via inlining, so it's not obvious from the generated code.
However, for this to happen, the compiler needs to know that callee doens't modify the passed object while it is compiling the caller. Otherwise, it will be restricted by the platform/language ABI which dictates exactly how values are passed to functions.
It can happen even without inlining!
Still, some compilers do implement this optimization even in the absence of inlining, although the circumstances are relatively limited, at least on platforms using the SysV ABI (Linux, OSX, etc) due to the constraints of stack layout. Consider the following simple example, based directly on your code:
__attribute__((noinline))
int foo(S s) {
return s.i + s.j + s.k + s.l + s.m + s.n + s.o + s.p;
}
int bar(S s) {
return foo(s);
}
Here, at the language level bar
calls foo
with pass-by-value semantics as required by C++. If we examine the assembly generated by gcc, however, it looks like this:
foo(S):
mov eax, DWORD PTR [rsp+12]
add eax, DWORD PTR [rsp+8]
add eax, DWORD PTR [rsp+16]
add eax, DWORD PTR [rsp+20]
add eax, DWORD PTR [rsp+24]
add eax, DWORD PTR [rsp+28]
add eax, DWORD PTR [rsp+32]
add eax, DWORD PTR [rsp+36]
ret
bar(S):
jmp foo(S)
Note that bar
just directly calls foo
, without making a copy: bar
will use the same copy of s
that was passed to bar
(on the stack). In particular it doesn't make any copy as is implied by the language semantics (ignoring as if). So gcc has performed exactly the optimization you requested. Clang doesn't do it though: it makes a copy on the stack which it passes to foo()
.
Unfortunately, the cases where this can work are fairly limited: SysV requires that these large structures are passed on the stack in a specific position, so such re-use is only possible if callee expects the object in the exact same place.
That's possible in the foo/bar
example since bar takes it's S
as the first parameter in the same way as foo
, and bar
does a tail call to foo
which avoids the need for the implicit return-address push that would otherwise ruin the ability to re-use the stack argument.
For example, if we simply add a + 1
to the call to foo
:
int bar(S s) {
return foo(s) + 1;
}
The trick is ruined, since now the position of bar::s
is different than the location foo
will expect its s
argument, and we need a copy:
bar(S):
push QWORD PTR [rsp+32]
push QWORD PTR [rsp+32]
push QWORD PTR [rsp+32]
push QWORD PTR [rsp+32]
call foo(S)
add rsp, 32
add eax, 1
ret
This doesn't mean that the caller bar()
has to be totally trivial though. For example, it could modify its copy of s, prior to passing it along:
int bar(S s) {
s.i += 1;
return foo(s);
}
... and the optimization would be preserved:
bar(S):
add DWORD PTR [rsp+8], 1
jmp foo(S)
In principle, this possibility for this kind of optimization is much greated in the Win64 calling convention which uses a hidden pointer to pass large structures. This gives a lot more flexibility in reusing existing structures on the stack or elsewhere in order to implement pass-by-reference under the covers.
Inlining
All that aside, however, the main way this optimization happens is via inlining.
For example, at -O2
compilation all of clang, gcc and MSVC don't make any copy of the S object1. Both clang and gcc don't really create the object at all, but just calculated the result more or less directly without even referring unused fields. MSVC does allocate stack space for a copy, but never uses it: it fills out only one copy of S
only and reads from that, just like pass-by-reference (MSVC generates much worse code than the other two compilers for this case).
Note that even though foo
is inlined into main
the compilers also generate a separate standalone copy of the foo()
function since it has external linkage and so could be used by this object file. In this, the compiler is restricted by the application binary interface: the SysV ABI (for Linux) or Win64 ABI (for Windows) defines exactly how values must be passed, depending on the type and size of the value. Large structures are passed by hidden pointer, and the compiler has to respect that when compiling foo
. It also has to respect that compiling some caller of foo
when foo cannot be seen: since it has no idea what foo
will do.
So there is very little window for the compiler to make a an effective optimization which transforms pass-by-value to pass-by-reference because:
1) If it can see both the caller and callee (main
and foo
in your example), it is likely that the callee will be inlined into the caller if it is small enough, and as the function becomes large and not-inlinable, the effect of fixed cost things like calling convention overhead become relatively smaller.
2) If the compiler cannot see both the caller and callee at the same time2, it generally has to compile each according to the platform ABI. There is no scope for optimization of the call at the call site since the compiler doesn't know what the callee will do, and there is no scope for optimization within the callee because the compiler has to make conservative assumptions about what the caller did.
1 My example is slightly more complicated that your original one to avoid the compiler just optimizing everything away entirely (in particular, you access uninitialized memory, so your program doesn't even have defined behavior): I populate a few of the fields of s
with argc
which is a value the compiler can't predict.
2 A compiler can see both "at the same time" generally means they are either in the same translation unit or that link-time-optimization is being used.