6

Given the following struct...

#include <type_traits>

struct C {
    long a[16]{};
    long b[16]{};

    C() = default;
};

// For godbolt
C construct() {
    static_assert(not std::is_trivial_v<C>);
    static_assert(std::is_standard_layout_v<C>);

    C c;
    return c;
}

...gcc (version 10.2 on x86-64 Linux) with enabled optimization (at all 3 levels) produces the following assembly[1] for construct:

construct():
        mov     r8, rdi
        xor     eax, eax
        mov     ecx, 32
        rep stosq
        mov     rax, r8
        ret

Once I provide empty default constructor...

#include <type_traits>

struct C {
    long a[16]{};
    long b[16]{};

    C() {}  // <-- The only change
};

// For godbolt
C construct() {
    static_assert(not std::is_trivial_v<C>);
    static_assert(std::is_standard_layout_v<C>);

    C c;
    return c;
}

...generated assembly changes to initializing every field individually instead of single memset in the original:

construct():
        mov     rdx, rdi
        mov     eax, 0
        mov     ecx, 16
        rep stosq
        lea     rdi, [rdx+128]
        mov     ecx, 16
        rep stosq
        mov     rax, rdx
        ret

Apparently, both structs are equivalent in terms of not being trivial, but being standard layout. Is it just gcc missing an optimization opportunity, or is there more to it from the C++-the-language perspective?


The example is a stripped down version of production code where this did have material difference in performance.


[1] Godbolt: https://godbolt.org/z/8n1Mae

Ilya Kurnosov
  • 3,180
  • 3
  • 23
  • 37
  • 4
    g++ is probably coded with an optimised behaviour when using the `= default`, but doesn't check if a hand-rolled constructor can be similarly optimised. It's quite common for compilers to not consider every optimisation opportunity. In this case, g++ developers probably didn't consider that a hand-rolled constructor with an empty body and not explicit initialisers was worth worrying about. If you think it's important, make a suggestion (to the g++ team, not here). Or simply update your coding standards to prefer the `=default`. – Peter Jan 24 '21 at 14:01
  • 2
    At least, the question is a nice recommendation to prefer `= default` over `{ }` for default constructors. (Something, I ever somehow felt since I learnt the option `= default`.) ;-) – Scheff's Cat Jan 24 '21 at 15:24
  • @Peter: Yeah, looks like the `= default` constructor is looking at the whole object as a single object as a candidate for memset init, while the `C(){}` explicit constructor is only looking at the individual members separately and fails to merge the memsets. That's clearly a missed optimization which you can report to https://gcc.gnu.org/bugzilla/. There are probably real-world cases with multiple zeroed members but still some init code to run. It only really becomes most visible when GCC chooses to expand memset as `rep stos` or loops, instead of fully unrolled `vmovdqu` stores. – Peter Cordes Jan 24 '21 at 17:03
  • 2
    With `-march=skylake`, instead of rep stos we see two separate loops for the split version. https://godbolt.org/z/Y1KeYW. (Using scalar stores, which is a separate memset-expansion missed optimization. GCC only uses wide SIMD stores when fully unrolling. [Trying to understand clang/gcc \_\_builtin\_memset on constant size / aligned pointers](https://stackoverflow.com/q/65534658). GCC8 and earlier still used rep stos which has significant startup overhead but can be better than 16-byte stores.) Anyway, certainly no advantage to 2 separate smaller memsets whatever strategy. – Peter Cordes Jan 24 '21 at 17:09

1 Answers1

2

While I agree that this seems like a missed optimization opportunity, I noticed one difference from the language level perspective. The implicitly-defined constructor is constexpr while the empty default constructor in your example is not. From cppreference.com:

That is, [the implicitly-defined constructor] calls the default constructors of the bases and of the non-static members of this class. If this satisfies the requirements of a constexpr constructor, the generated constructor is constexpr (since C++11).

So as the initialization of the arrays of long is constexpr, the implicitly-defined constructor is as well. However, the user-defined one is not, as it is not marked constexpr. We can also confirm this by trying to make the construct function of the example constexpr. For the implicitly-defined constructor this works without any problems, but for the empty user-defined version it fails to compile because

<source>:3:8: note: 'C' is not an aggregate, does not have a trivial default constructor, and has no 'constexpr' constructor that is not a copy or move constructor

as we can see here: https://godbolt.org/z/MnsbzKv1v

So to fix this difference we can make the empty user-defined constructor constexpr:

struct C {
    long a[16]{};
    long b[16]{};

    constexpr C() {}
};

Somewhat surprisingly, gcc now generates the optimized version, i.e. the exact same code as for the defaulted default constructor: https://godbolt.org/z/cchTnEhKW

I do not know why, but this difference in constexprness actually seems to help the compiler in this case. So while it seems like gcc should be able to generate the same code without specifying constexpr, I guess it is good to know that it can be beneficial.


As an additional test for this observation, we could try to make the implicitly-defined constructor non-constexpr and see if gcc fails to do the optimization. One simple way that I can think of to try to test this is to have C inherit from an empty class with a non-constexpr default constructor:

struct D {
    D() {}
};

struct C : D {
    long a[16]{};
    long b[16]{};

    C() = default;
};

And indeed, this generates the assembly that initializes the fields individually again. Once we make D() constexpr, we get the optimized code back. See https://godbolt.org/z/esYhc1cfW.

mjacobse
  • 365
  • 1
  • 12