Why does MSVC generate nop instructions for atomic loads on x64?

Question

If you compile code such as

#include <atomic>

int load(std::atomic<int> *p) {
    return p->load(std::memory_order_acquire) + p->load(std::memory_order_acquire);
}

you see that MSVC generates NOP padding after each memory load:

int load(std::atomic<int> *) PROC
        mov     edx, DWORD PTR [rcx]
        npad    1
        mov     eax, DWORD PTR [rcx]
        npad    1
        add     eax, edx
        ret     0

Why is this? Is there any way to avoid it without relaxing the memory order (which would affect the correctness of the code)?

Related, maybe answers this question too: https://stackoverflow.com/questions/44854497/why-does-64-bit-vc-compiler-add-nop-instruction-after-function-calls — 273K, Dec 30 '22 at 01:30

score 9 · Accepted Answer · answered Dec 30 '22 at 03:40

9

p->load() may eventually use the _ReadWriteBarrier compiler intrinsic.

According to this: https://developercommunity.visualstudio.com/t/-readwritebarrier-intrinsic-emits-unnecessary-code/1538997

the nops get inserted because of the flag /volatileMetadata which is now on by default. You can return to the old behavior by adding /volatileMetadata-, but doing so will result in worse performance if your code is ever run emulated. It’ll still be emulated correctly, but the emulator will have to pessimistically assume every load/store needs a barrier.

And compiling with /volatileMetadata- does indeed remove the npad.

answered Dec 30 '22 at 03:40

Artyer

31,034
3
47
75

1

So there can be metadata that means a 1-byte `nop` after a memory access should be treated as some kind of memory barrier when binary-translating to a weakly-ordered ISA? `_ReadWriteBarrier` only blocks compile-time reordering, but on x86(-64), that after a load is sufficient for an acquire operation, so I guess a translator could recognize that as an acquire load (aarch64 `ldar`)? It would need a way to signal a full `std::atomic_thread_fence(std::memory_order_acquire)` 2-way barrier (not a 1-way ordered *operation*), so maybe *every* 1-byte NOP is treated as a fence? – Peter Cordes Dec 30 '22 at 04:40
Anyway, yeah that explains that the NOP has some kind of meaning. Maybe they went with in-band NOPs to be able to support x86-64 JIT engines being aware of this ARM64ec thing? Otherwise pure metadata with the addresses of atomic operations and barriers could have avoided wasting front-end bandwidth and uop-cache footprint when running on actual x86-64, at a cost in binary size. But would also give room for more specific info about what kind of memory-order is required. – Peter Cordes Dec 30 '22 at 04:43
1

@PeterCordes I asked this in MSVC STL Discord, someone (not from MS) [assumed](https://discord.com/channels/737189251069771789/737734473751330856/1060227069625254008) that there's no out-of-band metadata, to save space, possibly ending up with spurious barriers for true NOPs inserted for alignment. This assumptions aligns with my experiments which show no signs of extra metadata. – Alex Guteniev Jan 04 '23 at 16:15

Why does MSVC generate nop instructions for atomic loads on x64?

1 Answers1