Is returning a 2-tuple less efficient than std::pair?

Question

Consider this code:

#include <utility>
#include <tuple>

std::pair<int, int> f1()
{
    return std::make_pair(0x111, 0x222);
}

std::tuple<int, int> f2()
{
    return std::make_tuple(0x111, 0x222);
}

Clang 3 and 4 generate similar code for both on x86-64:

f1():
 movabs rax,0x22200000111
 ret    
f2():
 movabs rax,0x11100000222 ; opposite packing order, not important
 ret

But Clang 5 generates different code for f2():

f2():
 movabs rax,0x11100000222
 mov    QWORD PTR [rdi],rax
 mov    rax,rdi
 ret

As do GCC 4 through GCC 7:

f2():
 movabs rdx,0x11100000222
 mov    rax,rdi
 mov    QWORD PTR [rdi],rdx ; GCC 4-6 use 2 DWORD stores
 ret

Why is the generated code worse when returning a std::tuple that fits in a single register, vs std::pair? It seems especially strange since Clang 3 and 4 seemed to be optimal, yet 5 is not.

Try it here: https://godbolt.org/g/T2Yqrj

Related: [Why is std::pair faster than std::tuple](https://stackoverflow.com/q/26863852/364696) — ShadowRanger, Oct 24 '17 at 03:36
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80792 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78302 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71301 — Marc Glisse, Oct 24 '17 at 19:07
@JohnZwinck - you have any link for godbolt or any other evidence that `gcc` actually produced the code you show for `f2()` for any version from 4 up to (but not including) 7? Your link has clang and all gcc versions I tried showed different code with two `DWORD` stores to memory. — BeeOnRope, Oct 26 '17 at 01:25
@BeeOnRope: GCC 4 through 6 use two DWORD stores as you say. GCC 7 uses a single QWORD store as shown in my question. The distinction between these two cases wasn't important to me, so I didn't call it out. — John Zwinck, Oct 26 '17 at 02:02
Well it's important in as much as two stores sucks pretty much twice as much as one store, so since the focus is on optimization or lack thereof, it would be good to point out that most versions of gcc outside the bleeding edge are even worse (unlike the underlying ABI issue though, this one can be solved by more optimization). — BeeOnRope, Oct 26 '17 at 02:06
@BeeOnRope: That's fair. I've added a comment next to the QWORD asm. — John Zwinck, Oct 26 '17 at 05:43

BeeOnRope · Accepted Answer · 2019-06-10T03:43:33.203

The short answer is because the libstc++ standard library implementation used by gcc and clang on Linux implements std::tuple with a non-trivial move constructor (in particular, the _Tuple_impl base class has a non-trivial move constructor). On the other hand, the copy and move constructors for std::pair are all defaulted.

This in turn causes a C++-ABI related difference in the calling convention for returning these objects from functions, as well as passing them by value.

The Gory Details

You ran your tests on Linux, which adheres to the SysV x86-64 ABI. This ABI has specific rules for passing or returning classes or structures to functions, which you can read more about here. The specific case we are interested in with whether the two int fields in these structures will get the INTEGER class or the MEMORY class.

A recent version of the ABI specification has this to say:

The classification of aggregate (structures and arrays) and union types works as follows:

If the size of an object is larger than eight eightbytes, or it contains un- aligned fields, it has class MEMORY 12 .

If a C++ object has either a non-trivial copy constructor or a non-trivial destructor 13 , it is passed by invisible reference (the object is replaced in the parameter list by a pointer that has class INTEGER) 14 .

If the size of the aggregate exceeds a single eightbyte, each is classified separately. Each eightbyte gets initialized to class NO_CLASS.

Each field of an object is classified recursively so that always two fields are considered. The resulting class is calculated according to the classes of the fields in the eightbyte

It is condition (2) that applies here. Note that it mentions only copy constructors, and not move constructors - but it is fairly apparently that just is probably just a defect in the specification given the introduction of move constructors which generally need to be included in any classification algorithm where copy constructors were included before. In particular, IA-64 cxx-abi, which gcc is documented to follow does include move constructors:

If the parameter type is non-trivial for the purposes of calls, the caller must allocate space for a temporary and pass that temporary by reference. Specifically:

Space is allocated by the caller in the usual manner for a temporary, typically on the stack.

and then the definition of non-trivial:

A type is considered non-trivial for the purposes of calls if:

it has a non-trivial copy constructor, move constructor, or destructor, or

all of its copy and move constructors are deleted.

So because tuple is not considered to be trivially copyable from an ABI perspective, it gets MEMORY treatment, which means that your function must populate the stack allocated object passed in by the called in rdi. The std::pair function can just pass back the entire structure in rax since it fits in one EIGHTBYTE and has class INTEGER.

Does it matter? Yeah, strictly speaking, a standalone function like the one you have compiled will be less efficient for tuple since this ABI different is "baked in".

Often however, the compiler will be able to see the body of the function and inline it or perform inter-procedural analysis even if not inlined. In both cases, the ABI is no longer important and it is likely both approaches would be equally efficient, at least with a decent optimizer. For example let's call your f1() and f2() functions and do some math on the result:

int add_pair() {
  auto p = f1();
  return p.first + p.second;
}

int add_tuple() {
  auto t = f2();
  return std::get<0>(t) + std::get<1>(t);
}

In principle the add_tuple method starts from a disadvantage, since it has to call f2() which is less efficient and it also has to create a temporary tuple object on the stack so it can pass it to f2 as the hidden parameter. Well no matter, both functions are fully optimized to just return the right value directly:

add_pair():
  mov eax, 819
  ret
add_tuple():
  mov eax, 819
  ret

So overall you can say that the effect of this ABI issue with tuple will be relatively muted: it adds a small fixed overhead to functions that must comply with the ABI, but this will only really matter in a relative sense for very small functions - but such functions are likely to be declared in a place where they can be inlined (or if not, you are leaving performance on the table).

libcstc++ vs libc+++

As explained above, this is an ABI issue, not an optimization issue, per se. Both clang and gcc are already optimizing the library code to maximum extent possible under the constraints of the ABI - if they generated code like f1() for the std::tuple case they would break ABI compliant callers.

You can see this clearly if you switch to using libc++ rather than the Linux default of libstdc++ - this implementation doesn't have the explicit move constructor (as Marc Glisse mentions in the comments, they are stuck with this implementation for backwards compatibility). Now clang (and presumably gcc although I didn't try it), generates the same optimal code in both cases:

f1():                                 # @f1()
        movabs  rax, 2345052143889
        ret
f2():                                 # @f2()
        movabs  rax, 2345052143889
        ret

Earlier Versions of Clang

Why do versions of clang compile it differently? It was simply a bug in clang or a bug in the spec depending on how you look at it. The spec didn't explicitly include move construction in the cases where a hidden pointer to a temporary needed to be passed. wasn't conforming to the IA-64 C++ ABI. For example compiled the way clang used to do it was not compatible with gcc or newer versions of clang. The spec was eventually updated and the clang behavior changed in version 5.0.

Update: Marc Glisse mentions in the comments that there was initially confusion about the interaction of non-trivial move constructors and the C++ ABI, and clang changed their behavior at some point, which probably explains the switch:

The ABI specification for some argument passing cases involving move constructors were unclear, and when they were clarified, clang changed to follow the ABI. This is probably one of those cases.

Why is `std::pair` trivially copyable while `std::tuple` isn't? I'd expect them both to be. Either way, doesn't this just need a trivial move constructor and destructor, and not a trivial copy constructor? — Daniel H, Oct 24 '17 at 04:25
@DanielH it isn't (in libstdc++) because the original implementation wasn't, and then not breaking the ABI made it impossible to change :-( — Marc Glisse, Oct 24 '17 at 05:53
To be fair, @DanielH question is valid: on my implementation `std::is_trivially_copyable` returns false for _both_ `tuple` and `pair`, so the C++ standard definition of that term should not be what's causing the difference. A a reading of the IA-64 C++ ABI indicates that to be passable in registers (which also means without the hidden pointer when used as a return type), a C++ class must not have _either a non-trivial copy constructor or a non-trivial destructor_, and _A de/constructor is trivial if it is an implicitly-declared default_. Neither `tutpe` nor `pair` satisfy that. — BeeOnRope, Oct 24 '17 at 05:57
@DanielH - I updated my answer: it is because `std::tuple` has a non-trivial move constructor, but `std::pair` does not. Yes, this is a violation of the standard which [specifies](http://en.cppreference.com/w/cpp/utility/tuple/tuple) that `tuple` should have a `default` move constructor. — BeeOnRope, Oct 24 '17 at 07:11
@BeeOnRope: Do you mean to say that the GNU standard library has a defective implementation of `std::tuple`? If so, won't it be basically impossible to change now? I'm also still a bit lost as to why Clang 3 and 4 did a better job optimizing `std::tuple` (I guess they're using a different standard library on Godbolt where I tested them, but if they can do it, why can't Clang 5?). — John Zwinck, Oct 24 '17 at 13:21
@JohnZwinck - correct, the implementation is not standards compliant in this area, if that's what you mean by defective. Whether the explicit move constructor helps or hurts performance overall, I'm not sure. Note that it has nothing to do with "optimization" per se, it's all about the ABI. Both `clang` and `gcc` are already optimizing the library code to maximum extent possible under the constraints of the ABI. If they generated code like `f1()` for standard tuple, it would _fail_ at runtime because the caller wouldn't get the expected result. — BeeOnRope, Oct 24 '17 at 18:40
The ABI specification for some argument passing cases involving move constructors were unclear, and when they were clarified, clang changed to follow the ABI. This is probably one of those cases. — Marc Glisse, Oct 24 '17 at 19:04
@LightnessRacesinOrbit, No; the terminology in these ABIs is: "byte refers to an 8-bit object... eightbyte refers to a 64-bit object". — Paul Du Bois, Apr 21 '20 at 22:02
Is the non-standards-compliance of libstdc++ something, that they can change with the upcoming new C++20 standard? E.G., that the move constructor of `std::tuple` will be trivial, as specified in the standard, when compiling with `-std=c++20`, but will be as is currently with other compiler flags? Or is it required, that modules compiled with different standards can be linked together? — Kai Petzke, Oct 16 '20 at 13:55

Is returning a 2-tuple less efficient than std::pair?

1 Answers1

The Gory Details

libcstc++ vs libc+++

Earlier Versions of Clang

Linked

Related