73

Consider these two approaches that can represent an "optional int":

using std_optional_int = std::optional<int>;
using my_optional_int = std::pair<int, bool>;

Given these two functions...

auto get_std_optional_int() -> std_optional_int 
{
    return {42};
}

auto get_my_optional() -> my_optional_int 
{
    return {42, true};
}

...both g++ trunk and clang++ trunk (with -std=c++17 -Ofast -fno-exceptions -fno-rtti) produce the following assembly:

get_std_optional_int():
        mov     rax, rdi
        mov     DWORD PTR [rdi], 42
        mov     BYTE PTR [rdi+4], 1
        ret

get_my_optional():
        movabs  rax, 4294967338 // == 0x 0000 0001 0000 002a
        ret

live example on godbolt.org


Why does get_std_optional_int() require three mov instructions, while get_my_optional() only needs a single movabs? Is this a QoI issue, or is there something in std::optional's specification preventing this optimization?

Also please note that users of the functions might be completely optimized out regardless:

volatile int a = 0;
volatile int b = 0;

int main()
{
    a = get_std_optional_int().value();
    b = get_my_optional().first;
}

...results in:

main:
        mov     DWORD PTR a[rip], 42
        xor     eax, eax
        mov     DWORD PTR b[rip], 42
        ret
Martin Ba
  • 37,187
  • 33
  • 183
  • 337
Vittorio Romeo
  • 90,666
  • 33
  • 258
  • 416
  • 6
    `optional` is returned through a hidden pointer, meaning the type definition contains something prohibiting returning it through a register. – Jester Oct 03 '17 at 12:00
  • 1
    The obvious difference is that `std::pair` is an aggregate, while `std::optional` is not. Don't know if it should have an effect, but ya' know... – StoryTeller - Unslander Monica Oct 03 '17 at 12:08
  • Doesn't `optional` have to allocate properly sized and aligned storage and then use placement new to actually construct the value? – NathanOliver Oct 03 '17 at 12:11
  • 3
    Same problem with `boost::optional`, on any version of GCC, no fancy C++17 required for demo: https://godbolt.org/g/MV14mr – John Zwinck Oct 03 '17 at 12:12
  • 3
    The aggregate vs non-aggregate typing, the SYS V x64 ABI and the fact that 4294967338 is 0x10000002a should make this clear. – Margaret Bloom Oct 03 '17 at 12:21
  • 1
    Why does non-aggregates have to be returned on the stack? – Passer By Oct 03 '17 at 12:21
  • 1
    You probably know that, but *strictly* speaking your "optional" isn't optional at all. Using a pair works with types like `int` but will have non-optional semantics with for example non-default-constructable types. – Daniel Jour Oct 03 '17 at 17:04
  • 1
    @PasserBy: It's not on the stack, it's through a hidden pointer (which could point wherever the caller wants it to point, e.g. into an existing object in memory). I think the idea is that every instance of the object always has an address, because constructing the return-value object might embed pointers into it, so the callee needs to know that address. – Peter Cordes Oct 03 '17 at 17:20
  • I don't think it's the construction that's more expensive. I think it's returning it that's more expensive. If you try constructing it without returning it, you may see very different results. – David Schwartz Oct 03 '17 at 22:35
  • Can you try folly::Optional<>? https://github.com/facebook/folly/blob/master/folly/Optional.h – Wojciech Migda Oct 04 '17 at 11:10
  • 3
    @WojciechMigda `folly::Optional` doesn't have the necessary magic to make its special member functions conditionally trivial. (It also violates ODR by using the internal-linkage `None` in inline functions, and every single `constexpr` or `FOLLY_CPP14_CONSTEXPR` function is ill-formed NDR: you can't implement `optional`'s `constexpr` API with `aligned_storage`.) +1 for being `co_await`-able, but they'd be better off stealing the `optional` implementation from range-v3 and adding on the rest of their API. – Casey Oct 04 '17 at 15:03

4 Answers4

46

libstdc++ apparently does not implement P0602 "variant and optional should propagate copy/move triviality". You can verify this with:

static_assert(std::is_trivially_copyable_v<std::optional<int>>);

which fails for libstdc++, and passes for libc++ and the MSVC standard library (which really needs a proper name so we don't have to call it either "The MSVC implementation of the C++ standard library" or "The MSVC STL").

Of course MSVC still won't pass an optional<int> in a register because the MS ABI.

EDIT: This issue has been fixed in the GCC 8 release series.

Casey
  • 41,449
  • 7
  • 95
  • 125
  • `optional` still requires a destructor though, and that prevent from returning it in registers. – Maxim Egorushkin Oct 03 '17 at 14:27
  • 6
    @MaximEgorushkin it only needs a destructor when T has one. – ratchet freak Oct 03 '17 at 14:52
  • 1
    Was that paper adopted? It's not reflected in the [latest draft](http://eel.is/c++draft/optional.ctor#6). – Barry Oct 03 '17 at 14:58
  • 1
    @Barry Per [this](http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/) it's sitting in Library Evolution right now. – NathanOliver Oct 03 '17 at 15:44
  • 1
    Correct, P0602 has not yet been adopted. The author made an attempt to communicate with the major implementors so that `optional` and `vector` could be "fixed" before people ship and lock in ABI. I believe libstdc++/libc++/MSFT `variant` all comply, as do libc++/MSFT `optional`, but apparently the libstdc++ `optional` maintainer did not get the memo. – Casey Oct 04 '17 at 14:25
  • P0602R4 was incorporated into the C++20 Working Draft in San Diego on 20181113. – Casey Nov 14 '18 at 03:04
  • Though the issue was fixed in GCC 8, it is present again in [trunk](https://godbolt.org/g/XWkbhn), GCC 10 and GCC 9 – ks1322 Oct 14 '20 at 10:27
19

Why does get_std_optional_int() require three mov instructions, while get_my_optional() only needs a single movabs?

The direct cause is that optional is returned through a hidden pointer while pair is returned in a register. Why is that, though? The SysV ABI specification, section 3.2.3 Parameter Passing says:

If a C++ object has either a non-trivial copy constructor or a non-trivial destructor, it is passed by invisible reference.

Sorting out the C++ mess that is optional is not easy, but there seem to be a non-trivial copy constructor at least in the optional_base class of the implementation I checked.

Toby Speight
  • 27,591
  • 48
  • 66
  • 103
Jester
  • 56,577
  • 4
  • 81
  • 125
  • 3
    Doesn't have to be, but could be, and that's what counts for the compiler. So you actually have to look at the implementation. – Jester Oct 03 '17 at 13:36
15

In Calling conventions for different C++ compilers and operating systems by Agner Fog it says that a copy constructor or destructor prevents from returning a structure in registers. This explains why optional is not returned in registers.

There has to be something else preventing the compiler from doing store merging (merges contiguous stores of immediate values narrower than a word into fewer wider stores to reduce the number of instructions)... Update: gcc bug 82434 - -fstore-merging does not work reliably.

Maxim Egorushkin
  • 131,725
  • 17
  • 180
  • 271
  • Any idea what this "something else" might be? – Daniel H Oct 03 '17 at 12:40
  • x86-64 does not have a `mov mem, imm64` so you can't merge the stores. – Jester Oct 03 '17 at 12:43
  • @Jester It could load `rax` with the merged value and store that into `[rdi]`. – Maxim Egorushkin Oct 03 '17 at 12:46
  • It certainly could, but I don't think that has any inherent speed advantage and this is shorter machine code. It certainly doesn't reduce the number of instructions. – Jester Oct 03 '17 at 13:07
  • @MaximEgorushkin the `get_std_optional_int():` returns pointer to the instance in `rax`, so if you would clobber it with immediate value, you would have to do another `mov rax,rdi` before `ret`. – Ped7g Oct 03 '17 at 13:09
  • It doesn't have to be `rax`, and it doesn't have to be "another". You could just do: `movabs rax, 4294967338; mov [rdi], rax; mov rax, rdi; ret`. – Jester Oct 03 '17 at 13:23
  • 1
    @Maxim: gcc7 implemented store coalescing (which gcc3 or gcc4 broke, so gcc for *years* has sucked at adjacent narrow assignments), but there's still a missed optimization in objects with padding. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82142. (clang seems to have the same missed optimization in this case.) gcc and clang will coalesce the stores if your write a function which stores through a `pair` pointer: https://godbolt.org/g/44zodQ. – Peter Cordes Oct 03 '17 at 17:14
  • @Jester: gcc/clang actually *do* `movabs rax, 4294967338` / `mov QWORD PTR [rdi], rax` if you store `{42,1}` into a `pair ` pointer. See previous comment. I agree this is questionable if you aren't bottlenecked on store throughput. – Peter Cordes Oct 03 '17 at 17:16
  • @PeterCordes It is not clear what exactly prevents store merging. [In this example just adding a destructor prevents it](https://godbolt.org/g/qzyufd). – Maxim Egorushkin Oct 03 '17 at 18:56
  • @MaximEgorushkin: That's disappointing. Probably because it's not trivially copyable. – Peter Cordes Oct 05 '17 at 05:48
3

The optimization is technically permitted, even with std::is_trivially_copyable_v<std::optional<int>> being false. However, it may require an unreasonable degree of "cleverness" for the compiler to find. Also, for the specific case of using std::optional as the return type of a function, the optimization may need to be done at link-time rather than compile-time.

Performing this optimization would have no effect on any (well-defined) program's observable behavior,* and is therefore implicitly allowed under the as-if rule. However, for reasons which are explained in other answers, the compiler has not been explicitly made aware of that fact and would need to infer it from scratch. Behavioral static analysis is inherently difficult, so the compiler may not be able to prove that this optimization is safe under all circumstances.

Assuming the compiler can find this optimization, it would then need to alter this function's calling convention (i.e. change how the function returns a given value), which normally needs to be done at link time because the calling convention affects all of the call sites. Alternatively, the compiler could inline the function entirely, which may or may not be possible to do at compile time. These steps would not be necessary with a trivially-copyable object, so in this sense the standard does inhibit and complicate the optimization.

std::is_trivially_copyable_v<std::optional<int>> ought to be true. If it were true, it would be much easier for compilers to discover and perform this optimization. So, to answer your question:

Is this a QoI issue, or is there something in std::optional's specification preventing this optimization?

It's both. The spec makes the optimization substantially harder to find, and the implementation is not "smart" enough to find it under those constraints.


* Assuming you haven't done something really weird, like #define int something_else.

Kevin
  • 28,963
  • 9
  • 62
  • 81
  • 1
    You're including the calling convention as part of "the implementation". That's technically true, but unless you enable whole-program / link-time optimization, it's not something that compilers on a given platform will even try to change other than by fully inlining, or making constant-propagation cloned versions of functions. – Peter Cordes Oct 03 '17 at 17:25
  • I guess gcc could emit something like `.gcc_reg_return_get_std_optional_int.clone123` as well as a normal ABI-compliant definition. Callers from other translation units couldn't assume this was present, though, so they'd have to call the regular version (unless you use LTO, in which case it would just inline because it's small). But if the function was actually large, then sure cloning an alternate-calling-convention version of the function would be useful. Probably most useful to return the parts in 2 separate regs, instead of packing into RAX. – Peter Cordes Oct 03 '17 at 17:29
  • @PeterCordes: Are you sure? The implementation of the *function* seems completely irrelevant here, only the implementation of `std::optional` should be necessary and since it is a template its implementation is always available. – Matthieu M. Oct 03 '17 at 17:58
  • @MatthieuM.: Yes, I'm confident that gcc and clang are following the C++ ABI they've decided to use, where **the criterion for packing into registers if small enough when returning or passing by value is `std::is_trivially_copyable_v`**. You could maybe change the C++ ABI to implement more complicated rules for the case of template classes where the destructor exists but is known to optimize away, or something. But this might require optimizations to be enabled always for compilers to agree with each other on how to pass certain objects. (e.g. inline and optimize away.) – Peter Cordes Oct 03 '17 at 18:16
  • The C++ ABI is not intended to be stable the way the C ABI / calling convention is ([x86-64 System V psABI](https://github.com/hjl-tools/x86-psABI/wiki/X86-psABI)), but it does have to be possible for different compilers to reliably agree with each other. – Peter Cordes Oct 03 '17 at 18:19
  • @PeterCordes: I added a note about link-time, but I consider this a very minor detail, frankly. Link-time optimization has been generally available for quite a while now, and I see little reason for most developers to turn it off. – Kevin Oct 04 '17 at 05:52
  • The build scripts in most open source projects don't enable it (unfortunately), because it's not on by default and there's no portable command-line option to enable it the way `-O2` is portable across different compilers. (Of course they should detect it along with other build-time detection, because it helps a lot). But anyway, the code in the question is showing the ABI-compliant definition. This is what you'd see if calling into a shared library or something, because LTO doesn't work across non-static library boundaries. – Peter Cordes Oct 04 '17 at 06:10
  • You might want to put more emphasis on this being the non-LTO version. (Hmm, I'm curious now what gcc will do with LTO enabled, or with `static` inside a single TU. I think `__attribute__((noinline))` is separate from `__attribute__((noclone))`, so it could still make a custom version. I suspect it won't invent a custom more-convention calling convention, though) – Peter Cordes Oct 04 '17 at 06:14
  • I tried it (by adding some busy-work to the function so it won't inline on its own, and making it `static`). gcc and clang stick to the normal calling convention, and don't even do constant-propagation from the constant return value. :/ https://godbolt.org/g/7sHCCt. So yeah, that's a missed optimization. (In clang's case, constant propagation for the return value fails even with `-stdlib=libc++`, which does implement `std::optional<>` that's trivially copyable and packs into a register. Leading to a 10-byte `movabs` to create a mask to `test` the return value...) – Peter Cordes Oct 04 '17 at 07:15
  • Submitted https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82432 re: the missed optimization, although that bug is more specifically about lack of constant propagation of constant return values from non-inline functions. – Peter Cordes Oct 04 '17 at 23:56
  • Turns out clang has constant propagation for `int` return values even without inlining. But not for aggregates like `optional`. Submitted https://bugs.llvm.org/show_bug.cgi?id=34839 and https://bugs.llvm.org/show_bug.cgi?id=34840. – Peter Cordes Oct 05 '17 at 05:46