How does x86 handle atomic instructions across NUMA nodes (with multiple CPU sockets)?

Question

When you run an atomic instruction (say, interlocked compare-exchange/add/etc.) on x86 on a memory location that's controlled by a CPU on another NUMA node, but not cached by any CPU, how does it get handled?

Do all CPUs in the path then stall until the operation is finished? Or does the second CPU continue running instructions as much as it can? Or does the operation become non-atomic in that case (!)?

What happens behind the scenes here?

*asserts the LOCK# signal until it's done with the operation.* - That hasn't been the case for years / decades. Before modifying a cache line, a core needs to get exclusive ownership of it (MESI). It can simply delay responding to share requests until after the atomic RMW. i.e. takes a "cache lock" for the duration of the operation. See [Can num++ be atomic for 'int num'?](https://stackoverflow.com/q/39393850) — Peter Cordes, Oct 27 '21 at 05:14
For a `lock`ed operation that's split between two cache lines, yes possibly there's some kind of slow global mechanism equivalent to the LOCK# shared-bus signal; they're extremely expensive, so software carefully avoids doing that. — Peter Cordes, Oct 27 '21 at 05:16
@PeterCordes: Thanks for the correction. I'm confused how my question is a duplicate though. Could you please explain which answer in that question explains how this is addressed in multi-socket systems? I see multi-core, but not multi-socket, and my question was *specifically* about how the hardware deals with this on a multi-socket system. — user541686, Oct 27 '21 at 07:29
The 2nd paragraph of the accepted answer explains that pre-PPro, they used a LOCK# signal on a shared bus. Since PPro, they don't. x86 NUMA systems (with on-chip memory controllers instead of a Northbridge and shared FSB) post-date PPro. The mechanism that answer explains for how atomic RMW is handled on modern x86 NUMA systems fairly obviously only depends on MESI, and making that MESI work sockets is a problem that already has to be handled for normal loads/stores (with snoop filters if necessary). So only cache-line-split `lock`s are a real problem, and that's not the normal case. — Peter Cordes, Oct 27 '21 at 08:21
If you want to edit this question to remove false premises, and ask something that isn't already covered by that duplicate, e.g. how cache-line-split (and maybe uncacheable) `lock`ed instructions (atomic RMWs) are handled on multi-socket NUMA systems, that could make it worth having this as a separate question we could reopen. — Peter Cordes, Oct 27 '21 at 08:25
editing mistake in 2nd-last comment: meant to write *making MESI work across sockets is a problem that already has to be handled* — Peter Cordes, Oct 27 '21 at 13:23
@PeterCordes: I think this is the first time in my life I've seen someone close a question as a duplicate merely because the OP had a misunderstanding of what happens in a *different* case than what they were asking about, but sure, I edited the question... can you reopen now? I feel like it should be pretty clear what I'm asking now, which is not the cached case. — user541686, Oct 27 '21 at 17:31
So you're specifically asking about UC memory mappings now, normally only used for MMIO? Not normal write-back (WB) cacheable memory that is always used for all memory that OSes let user-space allocate. Or do you mean normal (WB) memory that happens not to currently be hot in cache? If the latter, then the linked duplicate still explains what happens: RFO to get ownership of the line, just like for a store, then everything that happens to the cache line between then and responding to the next share request is atomic from the POV of any outside observer. "cacheable" doesn't mean "cached". — Peter Cordes, Oct 27 '21 at 17:37
And BTW, no I didn't close it because of a misunderstanding about a *different* case, I closed it because I assume you're asking about normal atomic RMW operations in normal programs and OSes, i.e. aligned and operating on WB-cacheable memory. The linked duplicate's descriptions of taking advantage of existing the MESI coherence mechanism by just doing a local "cache lock" still seems like an answer to me. Maybe I can find a better duplicate that explains in more detail, though. — Peter Cordes, Oct 27 '21 at 17:45
Re: NUMA, [Is mov + mfence safe on NUMA?](https://stackoverflow.com/q/54652663) discusses NUMA for atomic stores (and sequential consistency), with some mention of RFOs and MESI. Maybe there's enough to say about this that it should be collected into one answer here instead of scattered pieces in a few other existing answers. If it's going to also cover cache-line-split locks, it would have to be a 2-part answer I guess. — Peter Cordes, Oct 27 '21 at 18:01
@PeterCordes: Oh I meant normal WB memory that's not cached anywhere. I didn't mean uncacheable memory. I think your new link answers about half my question about multi-socket now, thanks. The parts that I think are still unanswered are: (1) Is TSO still guaranteed when you have RAM connected to separate physical CPUs (not just cores)? Is the K10 weirdness you mention the only one? Or are there other such weirdnesses that occur when accessing memory connected to a separate CPU? (2) What are the effects on the second CPU? I assume if it accesses that cache line then it stalls, but otherwise..? — user541686, Oct 28 '21 at 02:50
Ok, now we're getting closer to an answerable non-duplicate. Did you see the edits to the duplicate list as well? I didn't add [Is mov + mfence safe on NUMA?](https://stackoverflow.com/q/54652663) to the dup list because you were asking about atomic RMWs, not casting doubt on the memory model still being accurate for multi-socket machines. As my answer there says, yes, NUMA machines still follow x86-TSO. When you say "weirdness", note that 16-byte atomicity has never been guaranteed on paper, so the K10 hypertransport tearing effect is not violating any documented guarantee. — Peter Cordes, Oct 28 '21 at 02:58
There are many interesting *performance* effects in terms of bandwidth and latency for multi-socket systems, both both local and remote memory. e.g. if no cores on the other package are active, its L3 and memory controllers might clock down, too, hurting memory bandwidth on the other socket if they can't keep up with snoop requests. (Haswell and later allow the uncore to clock high even if all cores are asleep, see links [in this Q&A](https://stackoverflow.com/q/54796419), especially to Dr. Bandwidth's Intel forum post. — Peter Cordes, Oct 28 '21 at 03:15
Somewhat related: https://silo.tips/download/main-memory-and-cache-performance-of-intel-sandy-bridge-and-amd-bulldozer / [Could multi-cpu access memory simultaneously in common home computer?](https://stackoverflow.com/q/51577287) — Peter Cordes, Oct 28 '21 at 03:15
(2) What do you mean "second CPU"? Each socket has multiple cores, unless you're talking about old systems like P4 era which I think pre-date x86 NUMA. (AMD had on-chip memory controllers since K8 in 2004, before Intel, but IDK if they made any single-core-per-socket Opterons like that). Remote memory requests don't directly affect the CPU cores on the package servicing them. They contend for cycles in the memory controller, against L3 cache misses on this socket's cores that need to access the local DRAM, but that happens after cache-coherence is complete and is basically separate. — Peter Cordes, Oct 28 '21 at 03:20

How does x86 handle atomic instructions across NUMA nodes (with multiple CPU sockets)?

0 Answers0