When you run an atomic instruction (say, interlocked compare-exchange/add/etc.) on x86 on a memory location that's controlled by a CPU on another NUMA node, but not cached by any CPU, how does it get handled?
Do all CPUs in the path then stall until the operation is finished? Or does the second CPU continue running instructions as much as it can? Or does the operation become non-atomic in that case (!)?
What happens behind the scenes here?