passing rvalue to non-ref parameter, why can't the compiler elide the copy?

Question

struct Big {
    int a[8];
};
void foo(Big a);
Big getStuff();
void test1() {
    foo(getStuff());
}

compiles (using clang 6.0.0 for x86_64 on Linux so System V ABI, flags: -O3 -march=broadwell) to

test1():                              # @test1()
        sub     rsp, 72
        lea     rdi, [rsp + 40]
        call    getStuff()
        vmovups ymm0, ymmword ptr [rsp + 40]
        vmovups ymmword ptr [rsp], ymm0
        vzeroupper
        call    foo(Big)
        add     rsp, 72
        ret

If I am reading this correctly, this is what is happening:

getStuff is passed a pointer to foo's stack (rsp + 40) to use for its return value, so after getStuff returns rsp + 40 through to rsp + 71 contains the result of getStuff.
This result is then immediately copied to a lower stack address rsp through to rsp + 31.
foo is then called, which will read its argument from rsp.

Why is the following code not totally equivalent (and why doesn't the compiler generate it instead)?

test1():                              # @test1()
        sub     rsp, 32
        mov     rdi, rsp
        call    getStuff()
        call    foo(Big)
        add     rsp, 32
        ret

The idea is: have getStuff write directly to the place in the stack that foo will read from.

Also: Here is the result for the same code (with 12 ints instead of 8) compiled by vc++ on windows for x64, which seems even worse because the windows x64 ABI passes and returns by reference, so the copy is completely unused!

_TEXT   SEGMENT
$T3 = 32
$T1 = 32
?bar@@YAHXZ PROC                    ; bar, COMDAT

$LN4:
    sub rsp, 88                 ; 00000058H

    lea rcx, QWORD PTR $T1[rsp]
    call    ?getStuff@@YA?AUBig@@XZ         ; getStuff
    lea rcx, QWORD PTR $T3[rsp]
    movups  xmm0, XMMWORD PTR [rax]
    movaps  XMMWORD PTR $T3[rsp], xmm0
    movups  xmm1, XMMWORD PTR [rax+16]
    movaps  XMMWORD PTR $T3[rsp+16], xmm1
    movups  xmm0, XMMWORD PTR [rax+32]
    movaps  XMMWORD PTR $T3[rsp+32], xmm0
    call    ?foo@@YAHUBig@@@Z           ; foo

    add rsp, 88                 ; 00000058H
    ret 0

can you provide which calling convention you are compiling for? — pqnet, Mar 25 '18 at 10:21
Not sure this helps, but the standard says a function parameter is cannot be NRVO's. So there's at least one elision that cannot happen (except under the as-if rule, of course.) — juanchopanza, Mar 25 '18 at 10:36
According to the System V ABI `struct Big` isn't really big. It is actually borderline size to be considered of `INTEGER` class. Can you try a bigger size? — pqnet, Mar 25 '18 at 10:46
I stand corrected, apparently the misalignment of the `int` members causes the struct to be classified as `MEMORY` — pqnet, Mar 25 '18 at 10:58
@pqnet: IIRC, [x86-64 SysV](https://stackoverflow.com/questions/18133812/where-is-the-x86-64-system-v-abi-documented) packs structs into up to 2 integer registers, e.g. `rdx:rax` return values, so only for structs up to 16 bytes (not including padding). In this case we can see from the compiler output that it's returning via hidden reference and passing on the stack. — Peter Cordes, Mar 25 '18 at 11:00
@PeterCordes for return values i think that's the case, but for parameters INTEGER class is passed on registers, and the size for which a struct becomes MEMORY by sheer size is when it's bigger than 4 eightbytes. See https://software.intel.com/sites/default/files/article/402129/mpx-linux64-abi.pdf page 25 — pqnet, Mar 25 '18 at 11:04
@pqnet: That rule allows passing a `__m256i` in a YMM reg, and isn't the only rule for integer. You missed the footnote (10 in that old version of the ABI doc): *if the size of an object is larger than two eightbytes and the first eightbyte is not SSE or any other eightbyte is not SSEUP, it still has class MEMORY*. You can check what compilers do by looking at a simplified version on Godbolt: I removed the call to `foo`, leaving *just* the `getstuff()`. If it returns in registers, it turns into a `jmp` tailcall. If it returns by hidden pointer, it sets RDI https://godbolt.org/g/uHQ3GK — Peter Cordes, Mar 25 '18 at 11:24
It's a safe bet that compilers correctly follow the calling convention, especially if you check gcc and clang, especially for x86-64 System V, which is probably the most heavily used / well-tested target for those compilers. So checking compiler output is a useful way to see how the ABI rules apply to a specific case. — Peter Cordes, Mar 25 '18 at 11:24
Your Windows version uses a different `rsp` offset than what I got on Godbolt. I'm seeing x86-64 CL19 reserve 104 bytes on the stack (for `a[8]`). Or with `a[16]`, 168 bytes. How did you get 88? That's only enough for one copy of the struct. Did you actually use `a[6]`? That would explain copying 48 bytes instead of 32 or 64. The interesting thing there is that `T3 = T1`, so it's loading / storing back to the *same* place. It *did* do the optimization of reusing the same memory location, but didn't optimize away the copy. — Peter Cordes, Mar 26 '18 at 18:22

Peter Cordes · Accepted Answer · 2018-08-24T16:15:38.747

You're right; this looks like a missed-optimization by the compiler. You can report this bug (https://bugs.llvm.org/) if there isn't already a duplicate.

Contrary to popular belief, compilers often don't make optimal code. It's often good enough, and modern CPUs are quite good at plowing through excess instructions when they don't lengthen dependency chains too much, especially the critical path dependency chain if there is one.

x86-64 SysV passes large structs by value on the stack if they don't fit packed into two 64-bit integer registers, and them returns via hidden pointer. The compiler can and should (but doesn't) plan ahead and reuse the return value temporary as the stack-args for the call to foo(Big).

gcc7.3, ICC18, and MSVC CL19 also miss this optimization. :/ I put your code up on the Godbolt compiler explorer with gcc/clang/ICC/MSVC. gcc uses 4x push qword [rsp+24] to copy, while ICC uses extra instructions to align the stack by 32.

Using 1x 32-byte load/store instead of 2x 16-byte probably doesn't justify the cost of the vzeroupper for MSVC / ICC / clang, for a function this small. vzeroupper is cheap on mainstream Intel CPUs (only 4 uops), and I did use -march=haswell to tune for that, not for AMD or KNL where it's more expensive.

Related: x86-64 Windows passes large structs by hidden pointer, as well as returning them that way. The callee owns the pointed-to memory. (What happens at assembly level when you have functions with large inputs)

This optimization would still be available by simply reserving space for the temporary + shadow-space before the first call to getStuff(), and allowing the callee to destroy the temporary because we don't need it later.

That's not actually what MSVC does here or in related cases, though, unfortunately.

See also @BeeOnRope's answer, and my comments onit, on Why isn't pass struct by reference a common optimization?. Making sure the copy-constructor can always run at a sane place for non-trivially-copyable objects is problematic if you're trying to design a calling convention that avoids copying by passing by hidden const-reference (caller owns the memory, callee can copy if needed).

But this is an example of a case where non-const reference (callee owns the memory) is best, because the caller wants to hand off the object to the callee.

There's a potential gotcha, though: if there are any pointers to this object, letting the callee use it directly could introduce bugs. Consider some other function that does global_pointer->a[4]=0;. If our callee calls that function, it will unexpectedly modify our callee's by-value arg.

So letting the callee destroy our copy of the object in the Windows x64 calling convention only works if escape analysis can prove that nothing else has a pointer to this object.

Thanks for the answer. I'm still suspicious, I feel like this is something so simple that compilers would be pretty good here. I will report a bug, we will see... R.e. the windows point, this might explain some of why I end up talking cross purposes with Windows people about copies at function call boundaries. — John_C, Mar 25 '18 at 11:33
@John_C: This is one of those things that's simple for humans but much less simple for compilers, especially portable compilers that optimize in a generic internal representation before emitting code for the actual target architecture. On several of the missed-optimization bug reports I've filed, gcc devs have said it's hard to fix because a later stage of compilation would need to know something that an earlier stage didn't pass on, or vice versa. (e.g. the RTL optimizer doesn't know whether values are signed: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82267#c1, and followup bug 85038. — Peter Cordes, Mar 25 '18 at 11:45
Out of interest I tried this on windows and it does a copy too! See original post for details — John_C, Mar 26 '18 at 18:05
@John_C: yeah, I'm not surprised. It's still basically the same optimization, although it is even easier on Windows where you just have to pass a pointer. I'm not sure on Windows whether the callee is allowed to clobber the pointed-to copy of the data, i.e. whether it "owns" the arg it has a pointer to like with stack by args (or register args of course), or whether it's by const reference. Const-ref would allow passing pointers to non-stack memory, at least if you were sure the callee didn't also have a pointer to the static or heap memory you were passing, because arg values can't alias. — Peter Cordes, Mar 26 '18 at 18:15
Update: In Windows x64, the callee "owns" the pointed-to memory for args passed by hidden reference. So the caller always has to make a copy, unless inter-procedure optimization lets it know that that specific callee doesn't clobber the arg by using it as scratch space. — Peter Cordes, Aug 24 '18 at 12:00
I don't quite follow how that affects things here as there wouldn't be a semantic difference if `foo` clobbered or not, because we don't do anything with the result of `getStuff` after passing it to `foo` — John_C, Aug 24 '18 at 15:47
@John_C: Oh right, not *always* has to make a copy. As an optimization, it can let the callee destroy the temporary. But MSVC in practice misses that optimization. [What happens at assembly level when you have functions with large inputs](https://stackoverflow.com/q/50141623) — Peter Cordes, Aug 24 '18 at 16:01
@John_C: see also my last update to this answer with more gotchas. — Peter Cordes, Aug 24 '18 at 16:17

passing rvalue to non-ref parameter, why can't the compiler elide the copy?

1 Answers1

Linked

Related