8

If you compile code such as

#include <atomic>

int load(std::atomic<int> *p) {
    return p->load(std::memory_order_acquire) + p->load(std::memory_order_acquire);
}

you see that MSVC generates NOP padding after each memory load:

int load(std::atomic<int> *) PROC
        mov     edx, DWORD PTR [rcx]
        npad    1
        mov     eax, DWORD PTR [rcx]
        npad    1
        add     eax, edx
        ret     0

Why is this? Is there any way to avoid it without relaxing the memory order (which would affect the correctness of the code)?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
user541686
  • 205,094
  • 128
  • 528
  • 886
  • Related, maybe answers this question too: https://stackoverflow.com/questions/44854497/why-does-64-bit-vc-compiler-add-nop-instruction-after-function-calls – 273K Dec 30 '22 at 01:30

1 Answers1

9

p->load() may eventually use the _ReadWriteBarrier compiler intrinsic.

According to this: https://developercommunity.visualstudio.com/t/-readwritebarrier-intrinsic-emits-unnecessary-code/1538997

the nops get inserted because of the flag /volatileMetadata which is now on by default. You can return to the old behavior by adding /volatileMetadata-, but doing so will result in worse performance if your code is ever run emulated. It’ll still be emulated correctly, but the emulator will have to pessimistically assume every load/store needs a barrier.

And compiling with /volatileMetadata- does indeed remove the npad.

Artyer
  • 31,034
  • 3
  • 47
  • 75
  • 1
    So there can be metadata that means a 1-byte `nop` after a memory access should be treated as some kind of memory barrier when binary-translating to a weakly-ordered ISA? `_ReadWriteBarrier` only blocks compile-time reordering, but on x86(-64), that after a load is sufficient for an acquire operation, so I guess a translator could recognize that as an acquire load (aarch64 `ldar`)? It would need a way to signal a full `std::atomic_thread_fence(std::memory_order_acquire)` 2-way barrier (not a 1-way ordered *operation*), so maybe *every* 1-byte NOP is treated as a fence? – Peter Cordes Dec 30 '22 at 04:40
  • Anyway, yeah that explains that the NOP has some kind of meaning. Maybe they went with in-band NOPs to be able to support x86-64 JIT engines being aware of this ARM64ec thing? Otherwise pure metadata with the addresses of atomic operations and barriers could have avoided wasting front-end bandwidth and uop-cache footprint when running on actual x86-64, at a cost in binary size. But would also give room for more specific info about what kind of memory-order is required. – Peter Cordes Dec 30 '22 at 04:43
  • 1
    @PeterCordes I asked this in MSVC STL Discord, someone (not from MS) [assumed](https://discord.com/channels/737189251069771789/737734473751330856/1060227069625254008) that there's no out-of-band metadata, to save space, possibly ending up with spurious barriers for true NOPs inserted for alignment. This assumptions aligns with my experiments which show no signs of extra metadata. – Alex Guteniev Jan 04 '23 at 16:15