7

While benchmarking code involving std::optional<double>, I noticed that the code MSVC generates runs at roughly half the speed compared to the one produced by clang or gcc. After spending some time reducing the code, I noticed that MSVC apparently has issues generating code for std::optional::operator=. Using std::optional::emplace() does not exhibit the slow down.

The following function

void test_assign(std::optional<double> & f){
    f = std::optional{42.0};
}

produces

sub     rsp, 24
vmovsd  xmm0, QWORD PTR __real@4045000000000000
mov     BYTE PTR $T1[rsp+8], 1
vmovups xmm1, XMMWORD PTR $T1[rsp]
vmovsd  xmm1, xmm1, xmm0
vmovups XMMWORD PTR [rcx], xmm1
add     rsp, 24
ret     0

Notice the unaligned mov operations. On the contrary, the function

void test_emplace(std::optional<double> & f){
    f.emplace(42.0);
}

compiles to

mov     rax, 4631107791820423168      ; 4045000000000000H
mov     BYTE PTR [rcx+8], 1
mov     QWORD PTR [rcx], rax
ret     0

This version is much simpler and faster. These were generated using MSVC 19.32 with /O2 /std:c++17 /DNDEBUG /arch:AVX.

clang 14 with -O3 -std=c++17 -DNDEBUG -mavx produces

movabs  rax, 4631107791820423168
mov     qword ptr [rdi], rax
mov     byte ptr [rdi + 8], 1
ret

in both cases.

Replacing std::optional<double> with

struct MyOptional {
    double d;
    bool hasValue; // Required to reproduce the problem
    
    MyOptional(double v) {
        d = v;
    }

    void emplace(double v){
        d = v;
    }
};

exhibits the same issue. Apparently MSVC has some troubles with the additional bool member.

See godbolt for a live example.

Why is MSVC producing these unaligned moves? I.e. the question is not why they are unaligned rather than aligned (which wouldn't improve things according to this post). But why does MSVC produce a considerably more expensive set of instructions in the assignment case? Is this simply a bug (or missed optimization opportunity) by MSVC? Or am I missing something?

Sedenion
  • 5,421
  • 2
  • 14
  • 42
  • reading the gobolt code this is due to the use of that struct.still odd tho – pm100 Jun 25 '22 at 16:24
  • Looks to me like it is trying to write combine the bool and double into a vector op. One of those *compiler is trying to clever* mis-optimizations. – Goswin von Brederlow Jun 25 '22 at 16:34
  • @user17732522 oops sorry – Richard Critten Jun 25 '22 at 16:41
  • https://stackoverflow.com/questions/42697118/visual-studio-2017-mm-load-ps-often-compiled-to-movups/45466585#45466585 – Hans Passant Jun 25 '22 at 18:16
  • 1
    @user17732522 Yes, sorry, I just fixed it. Besides this, I don't think that the [answer from the other question](https://stackoverflow.com/a/45466585/3740047) answers the problem here. The other answer basically says "unaligned load/stores do not cost anything compared to aligned load/stores". But in the case here the compiler generates a bunch of additional instructions (regardless if they are unaligned or whatever) that are unnecessary in the first place (as shown by clang). And the additional instruction do cost performance. – Sedenion Jun 25 '22 at 18:24
  • 1
    If it had put the entire 16-byte `std::optional` in static memory, and copied it with `vmovups xmm1, [static_constant] / vmovups [rcx], xmm1` it might not have been so bad. But instead it puts only the `double` constant in static memory, constructs the `std::optional` on the stack, and *then* copies it to its destination. – Nate Eldredge Jun 25 '22 at 18:45
  • @NateEldredge: Actually it doesn't ever copy the `double` to the stack, it stores a `1` with a byte store, then does a 16-byte load, then replaces the low 8 bytes of the XMM register with a `movsd` register blend. So storing to the stack and reloading was just a way to get a `1` into byte 8 of an XMM register, with the rest being don't-care garbage. Taking 2 instructions and creating a store-forwarding stall, horrible vs. `vpcmpeqd xmm1,xmm1,xmm1` / `vpabsd xmm1, xmm1`. And then obviously if you want to merge a new low half from memory, `movlps` not `movsd`-load + `movsd xmm,xmm`. – Peter Cordes Jun 25 '22 at 20:20
  • 2
    Anyway, yeah, MSVC is fairly widely considered / known [not to be as good an optimizing compiler as clang or GCC](https://www.agner.org/optimize/blog/read.php?i=1015). The problem isn't that the `vmovups` instructions aren't `vmovaps`; the address likely is 16-byte aligned, since it knows the incoming stack alignment. The problem is the store-forwarding stall from narrow store, wide reload! (GCC isn't immune from shooting itself in the foot, too, though: [Bubble sort slower with -O3 than -O2 with GCC](https://stackoverflow.com/q/69503317)) – Peter Cordes Jun 25 '22 at 20:24

0 Answers0