14

Common wisdom is that std::unique_ptr does not introduce a performance penalty (and not a memory penalty when not using a deleter parameter), but I recently stumbled over a discussion showing that it actually introduces an additional indirection because the unique_ptr cannot be passed in a register on platforms with Itanium ABI. The example posted was similar to

#include <memory>

int foo(std::unique_ptr<int> u) {
    return *u;
}

int boo(int* i) {
    return *i;
}

Which generates an additional assembler instruction in foo compared to boo.

foo(std::unique_ptr<int, std::default_delete<int> >):
        mov     rax, QWORD PTR [rdi]
        mov     eax, DWORD PTR [rax]
        ret
boo(int*):
        mov     eax, DWORD PTR [rdi]
        ret

The explanation was that the Itanium ABI demands that the unique_ptr shall not be passed in a register because of the non-trivial constructor, so it created on the stack and then the address of this object is passed in a register.

I know that this does not really impact performance on a modern PC platform, but I am wondering if somebody could provide more details on the reasons why it shall not be copied to a register. Since zero-cost abstractions are one of the major goals of C++, I am wondering if this has been discussed in the standardization process as an accepted deviation or if it is a quality of implementation issue. The performance penalty is certainly small enough when considering the benefits, especially on modern PC platforms.

Commenters have pointed out that the two functions are not fully equivalent and thus the comparison is flawed since foo will also call the deleter on the unique_ptr parameter but boo does not release the memory. However, I was only interested in the difference resulting from passing a unique_ptr by-value compared to passing a plain pointer. I've modified the example code and included a call to delete to free the plain pointer; the call is in the caller because the unique_ptr's deleter also gets called in the caller's context to make the generated code more identical. In addition, the manual delete also checks ptr != nullptr because the destructor also does this. Still, foo does not pass the parameter in a register and has to do an indirect access.

I also wonder why the compiler does not elide the check for nullptr before calling operator delete since this is defined to be a noop anyway. I guess that unique_ptr could be specialized for the default deleter to not perform the check in the destructor, but that would be a very small micro-optimization.

Jens
  • 9,058
  • 2
  • 26
  • 43
  • This is not specific to `unique_ptr`, but likely applies to passing any non-trivially-constructible class type **by value**, as the example is doing. Plus, it is not really a fair comparison, because `foo()` has to destroy the `u` object upon exit, where `boo()` does not destroy its `i` parameter, so there is more code generated to handle that. – Remy Lebeau Jan 16 '19 at 21:51
  • Oh, sorry. Right idea, wrong on the details. The `foo` function could destroy an object. The `bar` function could not. So they're really not comparable. – David Schwartz Jan 16 '19 at 21:53
  • @RemyLebeau If this is a common behavior I am sure it was considered when `unique_ptr` was introduced, so it was a deliberate decision to introduce an abstraction with mandatory overhead. Didn't that spawn a lot of discussion in the standard committee? – Jens Jan 16 '19 at 21:54
  • @FrançoisAndrieux To my surprise, foo does not call the deleter of the `unique_ptr` parameter. As you can see in the example, the deleter gets called in the calling function `g` instead. – Jens Jan 16 '19 at 22:00
  • @Jens Yes, the call site is taking care of it. This is not my field of expertise but this sounds like a calling convention thing. – François Andrieux Jan 16 '19 at 22:04
  • @RemyLebeau The generated code for `foo` does not call the deleter which I find very strange; But still, even if it would and if I modified `boo` to also delete the passed object it would still be an overhead. A very small negligible one, but the additional instruction is still generated. I don't consider this a performance issue, but find it strange. – Jens Jan 16 '19 at 22:04
  • 2
    @Jens As far as I know, `std::unique_ptr` isn't guaranteed to be without performance overhead. I believe this "common wisdom" comes from how only simple optimizations are required to eliminate the overhead it introduces. If a particular implementation can't achieve it, it sounds like it might be a problem with the implementation. It's the same with `std::array` ([link](https://stackoverflow.com/questions/30263303/stdarray-vs-array-performance)) it's usually considered zero overhead but that depends on inlining `operator[]` and other member function calls. – François Andrieux Jan 16 '19 at 22:07
  • These examples aren't analogous -- the first function also frees the pointer on exit. If the function is not intended to free the pointer it should be `int foo(std::unique_ptr const& u)` – M.M Jan 16 '19 at 23:34
  • @M.M I've added a second example which also deletes the pointer for `boo` to be more equivalent. However, this does not change the indirection when passing a `unique_ptr` by value compared to a plain pointer, so the parameter access of `unique_ptr` always requires an additional instruction (once per parameter only). – Jens Jan 17 '19 at 08:54
  • @M.M semantically yes, but in practice no - the codegen **does not** free the pointer upon the exit. – SergeyA Jan 17 '19 at 14:20
  • @FrançoisAndrieux The behavior is consistent across plattforms and compilers available on godbolt which were able to compile the example. This includes POWER, ARM, X86 and ARM64 with different ABIs. Since the "overhead" is a result of parameter passing, inlining will eliminate (or let the compiler eliminate) it because there is no parameter passing anymore. However, this depends on the user code and not on the ability to inline library functions such as `std::array::operator[]`. – Jens Jan 18 '19 at 14:51

1 Answers1

9

System V ABI uses Itanium C++ ABI and refers to it. In particular, C++ Itanium ABI specifies that

If the parameter type is non-trivial for the purposes of calls, the caller must allocate space for a temporary and pass that temporary by reference.

Specifically:

...

If the type has a non-trivial destructor, the caller calls that destructor after control returns to it (including when the caller throws an exception), at the end of enclosing full-expression.

So a simple answer to question "why it is not passed into register" is "because it can't".

Now, an interesting question might be 'why did C++ Itanium ABI decided to go with that'.

While I wouldn't claim that I have intimate knowledge with rationale, two things come to mind:

  • This allows for copy elision if the argument to the function is a temporary
  • This makes tail-call optimizations more powerful. If callee would need to call destructors of it's arguments, TCO wouldn't be possible for any function which accepts non-trivial arguments.
Community
  • 1
  • 1
SergeyA
  • 61,605
  • 5
  • 78
  • 137
  • 1
    Thanks for the reference and the further explanation which also explains why the deleter is called in the caller. I always which that authors of standards would add rationals or publish an extra doc with their rationals for later use. I also tested other platforms that are available on Godbolt and it seems to be consistent on ARM, ARM64 and POWER. – Jens Jan 16 '19 at 22:21
  • 1
    Also, if the destructor is not trivial, the object must be in memory in order to be destructed during stack unwinding. – Oliv Jan 17 '19 at 07:34
  • @Oliv I am not sure this is a reason for deleting it in the caller's context. The object is a parameter of the called function and the lifetime is coupled to this function; stack unwinding as a result of an exception can only happen in that function and there, the object can be deleted instead of the caller's context. – Jens Jan 17 '19 at 09:09
  • 1
    @Jens Or the exception could be raised in a callee of the callee in which case the object should again be in memory. The reason object are constructed in caller context is that only the caller know how to build the arguments and in which order they are built. – Oliv Jan 17 '19 at 12:36