overhead for moving std::shared_ptr?

Question

Here is a C++ snippet. Func1 generates a shared object, which is directly moved into Func2. We think that there should not be overhead in Func3. Putting this snippet into Compiler Explorer, we see a 2-3 times shorter code with MSVC compared to clang or GCC. Why is that, and can one obtain the shorter code with clang/GCC?

It looks like Func3 generates exception handling code for cleaning up the temporary shared object.

#include <memory>

std::shared_ptr<double> Func1();
void Func2 (std::shared_ptr<double> s);

void Func3()
{
  Func2(Func1());
}

code length is not a good metric of its performance. Also you didn't use any optimization flags. — Marek R, Aug 08 '23 at 12:27
The godbolt link you gave shows unoptimized code (missing any option such as `/O2`), did you compare unoptimized code? That's possible of course, but if you do that I hope it's as a conscious decision, otherwise you're probably looking at the wrong thing. — harold, Aug 08 '23 at 12:29
Most (bigger) CPU's are not dumb things that execute one assembly instruction at a time, they have pipelines that can execute instructions in parallel as long as they don't influence each other. They also use caches, branche predictions etc. So no don't look at assembly instructions. When you think you have a performance issue there is only one thing for you to do : measure on your (target) hardware and measure a lot. — Pepijn Kramer, Aug 08 '23 at 12:31
Also you ar NOT moving shared_pointers but copying them. On top of that moving shared_pointers is strange because a move models "transfer of ownership" and a shared_ptr is just that one thing that doesn't model exclusive ownership. So a `std::unique_ptr` would be a better choice to test move semantics — Pepijn Kramer, Aug 08 '23 at 12:31
@PepijnKramer measuring without looking is one of the things that get people into trouble: often they measure something different than what they thought they were measuring, invalidating any conclusions. Definitely look. Of course a basic length comparison is too simplistic. — harold, Aug 08 '23 at 12:33
@harold I indeed failed to specify what to measure. I assumed speed, but you are right it could have been memory use or number of instructions. So OP what overhead are you trying to measure? — Pepijn Kramer, Aug 08 '23 at 12:34
@PepijnKramer well, I only meant speed, but yes, that too. I mean, what often happens is that an attempt it made to measure the speed of something, but instead the speed of something else ends up being measured (the speed of print statements, the speed of allocating memory and touching it for the first time, the speed of code that was optimized away, that sort of thing). — harold, Aug 08 '23 at 12:40
@PepijnKramer no copy should be taking place here. `Func1()` yields a prvalue which is fed into the function parameter for `Func2`, and this is mandatory copy elision. The code size difference looks to be the result of GCC and clang failing to realize that the destructor is a no-op in this case, so they emit (possibly useless) extra code. — Jan Schultke, Aug 08 '23 at 12:46
Interestingly, adding an (ostensibly redundant) `std::move` makes code generated by MSVC much more similar to that generated by gcc and clang. — n. m. could be an AI, Aug 08 '23 at 12:54
@JanSchultke Then I learned something today :) I thought copy elission only applied for return values not for passing values to functions like Func2. No idea why I could have missed that for so long — Pepijn Kramer, Aug 08 '23 at 13:19

Jan Schultke · Accepted Answer · 2023-08-08T13:34:20.137

The problem boils down to platform ABI, and is better illustrated by a completely opaque type:

struct A {
    A(const A&);
    A(A&&);
    ~A();
};

A make() noexcept;
void take(A) noexcept;

void foo() {
    take(make());
}

See comparison at Compiler Explorer

MSVC Output

void foo(void) PROC
        push    ecx
        push    ecx
        push    esp
        call    A make(void)
        add     esp, 4
        call    void take(A)
        add     esp, 8
        ret     0
void foo(void) ENDP

GCC Output (clang is very similar)

foo():
        sub     rsp, 24
        lea     rdi, [rsp+15]
        call    make()
        lea     rdi, [rsp+15]
        call    take(A)
        lea     rdi, [rsp+15]
        call    A::~A() [complete object destructor]
        add     rsp, 24
        ret

If the type has a non-trivial destructor, the caller calls that destructor after control returns to it (including when the caller throws an exception).

- Itanium C++ ABI §3.1.2.3 Non-Trivial Parameters

Explanation

What takes place here is:

make() yields a prvalue of type A
this is fed into the parameter of take(A)
- mandatory copy elision takes place, so there is no call to copy/move constructors
only GCC and clang destroy A at the call site

MSVC instead destroys the temporary A (or in your case, std::shared_ptr) inside the callee, not at the call site. The extra code you're seeing is an inlined version of the std::shared_ptr destructor.

In the end, you shouldn't see any major performance impact as a result. However, if Func2 resets/releases the shared pointer, then most of the destructor code at the call site is dead, unfortunately. This ABI problem is similar to an issue with std::unique_ptr:

There is also a language issue surrounding the order of destruction of function parameters and the execution of unique_ptr's destructor. For simplicity that is being ignored in this paper, but a complete solution to "unique_ptr is as cheap to pass a T*" would have to address that as well.

P2028 What is ABI, and What Should WG21 Do About It?

overhead for moving std::shared_ptr?

1 Answers1

MSVC Output

GCC Output (clang is very similar)

Explanation

See Also