Why didn't x86 implement direct core-to-core messaging assembly/cpu instructions?

Question

After serious development, CPUs gained many cores, gained distributed blocks of cores on multiple chiplets, numa systems, etc but still a piece of data has to pass through not only L1 cache (if on same core SMT) but also some atomic/mutex synchronization primitive procedure that is not accelerated by hardware.

I wonder why didn't Intel or IBM come up with something like:

movcor 1 MX 5 <---- sends 5 to Messaging register of core 1
pipe 1 1 1 <---- pushes data=1 to pipe=1 of core=1 and core1 needs to pop it
bcast 1 <--- broadcasts 1 to all cores' pipe-0

to make it much faster than some other methods? GPUs support block-wise fast synchronization points, like barrier() or __syncthreads(). GPUs also support parallel atomic update acceleration for local arrays.

When CPUs gain 256 cores, won't this feature enable serious scaling for various algorithms that are bottlenecked on core-to-core bandwidth (and/or latency)?

CPU architecture is insanely complex. Asking simple, *"why didn't they"* questions like this requires 1) Looking into the minds of the Intel engineers in the past, and 2) Enough reference material to fill a textbook. I don't think this is effectively answerable here. I suspect the answers will involve some combination of : takes up too much die area (or too many layers), problems relating to process isolation and security, context switchinge (who owns the message registers?), problems relating to synchronization/deadlocks/interrupts when each core can poke the other at any time, etc. — J..., May 11 '22 at 19:50
@J...: Yeah, context switching where the OS needs to virtualize this state seems like a major hurdle for this to be usable by software. Frought with race conditions, at best, and the HW might have to provide features to make it usable under a normal OS, like thread-ID tags if someone really really wanted this instead of just using a GPU, since GPUs exist for problems like this. I think the question is answerable as far as pointing out a few possible showstoppers to show that it's not that simple, and also to question the assumptions about lack of scalability. — Peter Cordes, May 11 '22 at 20:07
@PeterCordes for example when mutexes are close enough, they make false-sharing and diminish the scaling in existence of lock/unlock contention. But a dedicated pipeline would not be false-shared I guess. Another sample could be having a single-channel RAM and 128 cores and using all cores on atomic updates of variables in parallel, maybe this makes single-channel RAM not enough / creates RAM dependency? — huseyin tugrul buyukisik, May 11 '22 at 20:10
@huseyintugrulbuyukisik: Huh? What does DRAM have to do with atomic updates? Once a core gets MESI exclusive ownership of a cache line, `lock cmpxchg [rdi], ecx` or whatever just hangs on to ownership of the cache line during the operation; purely local to the core. And yeah, if you write inefficient software, it's going to suck. It's not trivial to write software for huge numbers of cores. — Peter Cordes, May 11 '22 at 20:19
If you want low-cost scalable atomic RMWs, the sensible thing is to put a simple ALU into the shared cache, e.g. so a core can send an atomic-increment request to shared L3 where it happens on the data there, instead of having to get exclusive ownership of the line temporarily. (With the old value being returned read-only to allow a fetch_add return value). I've seen mention of this idea (I didn't think of it), maybe even implemented in some CPU. — Peter Cordes, May 11 '22 at 20:22
How would CPUs handle flow control for any of the instructions (e.g. if many senders are each sending faster than receiving software can receive)? Does data get overwritten/lost before receiver can do anything with it? Do sending CPUs wait potentially forever for space to become available? Do queues continually grow until you run out of space (before data is lost and/or senders start waiting)? If sender stores the data in memory then sends a notification with no data (an inter-processor interrupt) all the problems can be solved in software however software wants, but that's what we have now. — Brendan, May 16 '22 at 02:29
I spent too many years fighting this fight inside IBM and AMD to summarize briefly, but there are both engineering and cultural issues at play.... One processor family that can directly support core-to-core (or core-to-device or device-to-core) interfaces is the Cadence Tensilica Xtensa line. The capabilities are discussed in a white paper: https://ip.cadence.com/uploads/814/TIP_WP_High_Speed_Alternatives_V5_FINAL-pdf — John D McCalpin, Jun 03 '22 at 15:12

Peter Cordes · Accepted Answer · 2022-05-12T16:19:19.240

CPUs evolved for a very different programming model than GPUs, to run multiple separate threads, potentially of different processes, so you'd also need software and OS infrastructure to let threads know which other core (if any) some other thread was running on. Or they'd have to pin each thread to a specific core. But even then it would need some way to virtualize the architectural message-passing register, the same way context switches virtualize the standard registers for multi-tasking on each core.

So there's an extra hurdle before anything like this could even be usable at all under a normal OS, where a single process doesn't take full ownership of the physical cores. The OS is still potentially scheduling other threads of other processes onto cores, and running interrupt handlers, unlike a GPU where cores don't have anything else to do and are all build to work together on a massively parallel problem.

Intel did introduce user-space interrupts in Sapphire Rapids. Including user IPI (inter-processor interrupt), but that doesn't involve a receive queue that would have to get saved/restored on context switch. The OS still has to manage some stuff (like the User Interrupt Target table), but it's not as problematic for context switches, I think. It's solving a different problem than what you're suggesting, since it's an interrupt not a message queue.

Notifying another thread when to look in shared memory for data is the hard part of the problem that needed solving, moreso than getting data between cores. Shared memory is still ok for that (especially with new instructions like cldemote to let the writer ask for a recently-stored cache line to be written back to shared L3 where other cores can read it more efficiently). See the section below about UIPIs.

A task that wants something like this is usually best done on a GPU anyway, not a few separate deeply pipelined OoO exec CPU cores that are trying to do speculative execution. Unlike GPUs that are simple in-order pipelines.

You couldn't actually push a result to another core until it retires on the core executing it. Because you don't want to have to roll back the other core as well if you discover a mis-speculation such as a branch mispredict earlier in the path of execution leading to this. That could conceivably still allow for something lower-latency than bouncing a cache-line between cores for shared memory, but it's a pretty narrow class of application that can use it.

However, high-performance computing is a known use-case for modern CPUs, so if it was really a game-changer it would be worth considering as a design choice, perhaps.

Some things aren't easy or possible to do efficiently given the architecture of existing CPUs. Small units of work with fine-grained cooperation between threads is a problem. Your idea might help if it was practical to implement, but there are major challenges.

Inter-processor Interrupts (IPI), including user-IPI

For OS use, there is the IPI mechanism. But that triggers an interrupt so it doesn't line up data to be read when the receive side's out-of-order exec pipeline gets to it, so it's very different from the mechanism you're suggesting, for different use-cases.

It's quite low-performance except to avoid polling by the other side. And to be able to wake up a core from a power-saving sleep state, if more threads are now ready to run so it should wake up can call schedule() to figure out which one to run.

Any core can send an IPI to any other, if it's running in kernel mode.

New in Sapphire Rapids, there is hardware support for the OS letting a user-space process handle some interrupts fully in user-space.

https://lwn.net/Articles/869140/ is an LKML post explaining it and how Linux could support it. Apparently it's about 10x faster than "eventfd" for ping-ponging a tiny message between two user-space threads a million times. Or 15x faster than doing the same with a POSIX signal handler.

Kernel managed architectural data structures

UPID: User Posted Interrupt Descriptor - Holds receiver interrupt vector information and notification state (like an ongoing notification, suppressed notifications).

UITT: User Interrupt Target Table - Stores UPID pointer and vector information for interrupt routing on the sender side. Referred by the senduipi instruction.

The interrupt state of each task is referenced via MSRs which are saved and restored by the kernel during context switch.

Instructions

senduipi <index> - send a user IPI to a target task based on the UITT index.

clui - Mask user interrupts by clearing UIF (User Interrupt Flag).

stui - Unmask user interrupts by setting UIF.

testui - Test current value of UIF.

uiret - return from a user interrupt handler.

So it does have new state to be saved/restored on context switch. I suspect it's smaller than the queues you might have been picturing, though. And critically, doesn't need a receive queue anywhere for threads that aren't running, because the state involves tables in memory and there's no data, just a pending or not flag I guess. So in the not-running receiver case, it can just set a bit in a table that senders need to be able to see anyway, for the HW to know where to direct the UIPI. Unlike needing to find the kernel's saved register-state or some other space and append to a variable-size(?) buffer for your idea.

If the receiver is running (CPL=3), then the user interrupt is delivered directly without a kernel transition. If the receiver isn't running the interrupt is delivered when the receiver gets context switched back. If the receiver is blocked in the kernel, the user interrupt is delivered to the kernel which then unblocks the intended receiver to deliver the interrupt.

So data still has to go through memory, this is just about telling another thread when to look so it doesn't have to poll / spin.
I think UIPIs are useful for different use-cases than your proposed message queue.

So you still wouldn't generally use this when the receiving thread knows that specific data is coming soon. Except maybe to let a thread be working on something independent instead of spin-waiting or sleeping.

It's also usable if the thread doesn't specifically expect data soon, unlike your queue idea. So you can have it working on something low priority, but then start work that's part of the critical path as soon as more of that is ready.

It's still an interrupt, so significant overhead is still expected, just a lot less than going through the kernel for a signal handler or similar. I think like any interrupt, it would need to drain the out-of-order back-end. Or maybe not even that bad, maybe just treat it like a branch mispredict since it doesn't have to change privilege level. Discarding instructions in the ROB would be lower interrupt latency, but worse throughput and just re-steering the front-end to the interrupt-handler address.

Incorrect scalability assumptions in your question

scaling for various algorithms that are bottlenecked on core-to-core bandwidth (and/or latency)?

Mesh interconnects (like Intel since Skylake Xeon) allow pretty large aggregate bandwidth between cores. There isn't a single shared bus they all have to compete for. Even the ring bus Intel used before Skylake-Xeon, and still uses in client chips, is pipelined and has pretty decent aggregate bandwidth.

Data can be moving between every pair of cores at the same time. (I mean, 128 pairs of cores can each have data in flight in both directions. With some memory-level parallelism, a pipelined interconnect can have multiple cache lines in flight requested by each core.)

This involves shared L3 cache, but typically not DRAM, even across sockets. (Or on AMD where clusters of cores are tightly connected in a CCX core complex, between those within the same die).

See also some Anandtech articles with good benchmarks of inter-core latency (cache-line ping-pong)

https://www.anandtech.com/show/16529/amd-epyc-milan-review/4 AMD Zen3 Epyc (server)
https://www.anandtech.com/show/16214/amd-zen-3-ryzen-deep-dive-review-5950x-5900x-5800x-and-5700x-tested/5 AMD Zen 3 desktop
https://www.anandtech.com/show/16594/intel-3rd-gen-xeon-scalable-review/4 Intel Ice Lake Xeon (vs. AMD and vs. a big AArch64 Ampere Altra)
https://www.anandtech.com/show/16315/the-ampere-altra-review/3 Ampere Altra (2x 80 AArch64 cores)
https://www.anandtech.com/show/15578/cloud-clash-amazon-graviton2-arm-against-intel-and-amd/2 Amazon's Graviton2 AArch64 vs. AMD and Intel.

GPUs also support parallel atomic update acceleration for local arrays.

I think I've heard of some CPUs (at least in theory, maybe practice) allowing fast memory_order_relaxed atomics by putting a simple ALU into the shared cache. So a core can send an atomic-increment request to shared L3 where it happens on the data there, instead of having to get exclusive ownership of the line temporarily. (With the old value being returned read-only to allow a fetch_add or exchange return value).

This wouldn't be easily ordered wrt. other loads or stores on other locations done by the core that sent the atomic operation request out to be done by the cache.

Anandtech's Graviton2 review shows a slide that mentions "Dynamic near vs. far atomic execution". That might be a form of this idea! Perhaps allowing it to execute remotely (perhaps at the core owning the cache line?) if memory ordering requirements are weak enough to allow it for this instruction? That's just a guess, that's a separate question which I won't dig into further here.

(ARMv8.1 with "large system extensions" provide x86-style single-instruction atomic RMWs, as well as the traditional LL/SC that require a retry loop in case of spurious failure, but can synthesize any atomic RMW operation like you can with a CAS retry loop.)

@huseyintugrulbuyukisik: Yes, that's what I was talking about. I guess I forgot to specifically say that, only mentioned what they used to use, so updated again. — Peter Cordes, May 11 '22 at 20:02
@huseyintugrulbuyukisik: AMD's CCX architecture makes groups of cores fairly tightly coupled (sharing an L3 cache separate from other CCXs), so it could be good for that. See https://www.anandtech.com/show/16529/amd-epyc-milan-review/4 and https://www.anandtech.com/show/16214/amd-zen-3-ryzen-deep-dive-review-5950x-5900x-5800x-and-5700x-tested/5 for example, which has inter-core latency benchmarks for Epyc and for desktop chips respectively. Inter-core latencies are best case 19 to 31ns (within a CCX), worst case around 100 to 114 ns in a remote CCX on the Zen3 Epyc. 200ns other socket. — Peter Cordes, May 11 '22 at 20:10
@huseyintugrulbuyukisik: I don't know CUDA/OpenCL well at all, unfortunately, so I don't know what that does. GPU cores don't do speculative execution so not directly equivalent. — Peter Cordes, May 11 '22 at 20:24
@huseyintugrulbuyukisik: Re: inter-core latencies again, https://www.anandtech.com/show/16594/intel-3rd-gen-xeon-scalable-review/4 has some numbers on inter-core latency on Intel. In theory a mesh is great (and at least high throughput), but the actual latency on Intel is often not great, like in the 45 to 52 ns range on Ice Lake Xeon within a socket. — Peter Cordes, May 11 '22 at 20:25
@PeterCordes might be worth mentioning that coming with sapphire rapids for Intel at least [ULI will be supported](https://lwn.net/Articles/869140/) with [a few new instructions](https://www.phoronix.com/scan.php?page=news_item&px=Intel-New-Instructions-EHFI) — Noah, May 12 '22 at 14:38
@Noah: Thanks, that's a really good point, I'd forgotten about UIPIs. That does have some similarities to the idea in the question, in terms of pure user-space IPC, but it's more about notification than moving data. Perhaps that's exactly the point, that low-overhead notification is the hard part, knowing *when* to look for data in a certain buffer. Updated. — Peter Cordes, May 12 '22 at 15:56
@huseyintugrulbuyukisik: Updated with a section about user-space inter-processor interrupts (thanks to Noah for reminding me), as well as incorporating some of the stuff from comments on other topics. — Peter Cordes, May 12 '22 at 15:57
@Noah: do you know if `cldemote` is useful for reducing inter-thread latency, helping a writer push its recent stores closer to where other cores can see them? I added a mention of that to the answer. I've suggested it for that purpose before in answers like [x86 MESI invalidate cache line latency issue](https://stackoverflow.com/q/54209039) / [How to force cpu core to flush store buffer in c?](https://stackoverflow.com/q/54067605) / [Is there any way to write for Intel CPU direct core-to-core communication code?](https://stackoverflow.com/q/58741806) — Peter Cordes, May 12 '22 at 16:03
@PeterCordes can't say for sure regarding `cldemote`. Don't have an Alder Lake CPU but the description seems to suggest as much and [Intel appears to be using it like that in the kernel](https://patches.dpdk.org/project/dpdk/patch/1599700614-22809-2-git-send-email-omkar.maslekar@intel.com/#118410) — Noah, May 12 '22 at 17:35
@huseyintugrulbuyukisik: Where did you get that idea? That sounds extremely unlikely, even for a core to be able to run `senduipi` at one per 4 or 5 cycle throughput to spam UIPI messages to different destinations. And I'd be extremely surprised if any given receiver could get interrupted and run `uiret` with a throughput of 5 clocks. — Peter Cordes, May 12 '22 at 19:15
@huseyintugrulbuyukisik: Any given interrupt always interrupts only one logical CPU core, of course. (Including UIPIs, but also including device interrupts like a network card or disk.) IDK if you're picturing that CPUs normally work by having every core get interrupted by a disk or network device, but that would be totally useless since they'd need a mutex to figure out which core should actually talk to the device. Even for timer interrupts, each core has its own. Cores don't run in lock-step. (And each core can temporarily disable interrupts.) — Peter Cordes, May 12 '22 at 19:26
@huseyintugrulbuyukisik: Anyway, with 100 cores or something, yeah it could be plausible that each core is sending or receiving one every 500 cycles or so at 5GHz, if you're running a microbenchmark or have a really inefficient program that spends most of its time interrupting itself instead of getting work done. — Peter Cordes, May 12 '22 at 19:28
@huseyintugrulbuyukisik: And also aggregating across multiple cores for some reason? You didn't say that. The usual thing to measure is how fast one core can do something, then you can just scale by number of cores in whatever CPU you're using. IDK if throughput is really very relevant, though, since real use-cases wouldn't be spamming this. It's more about *latency* of the notification, and how much throughput it costs surrounding code, since you'd normally only get interrupted or send an interrupt a few times. — Peter Cordes, May 12 '22 at 19:37
@huseyintugrulbuyukisik: you can only send a UIPI to a thread of your own process. Thus you can't disturb cores running privileged code with this, or cores running other user's unprivileged code. — Peter Cordes, May 12 '22 at 19:42
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/244700/discussion-between-peter-cordes-and-huseyin-tugrul-buyukisik). — Peter Cordes, May 12 '22 at 20:02
@huseyintugrulbuyukisik: Deleted my earlier off-topic comments after moving to chat; please do the same. — Peter Cordes, May 12 '22 at 20:07

Why didn't x86 implement direct core-to-core messaging assembly/cpu instructions?

1 Answers1

Inter-processor Interrupts (IPI), including user-IPI

Kernel managed architectural data structures

Instructions

Incorrect scalability assumptions in your question

Linked