Is memory outside each core always conceptually flat/uniform/synchronous in a multiprocessor system?

Question

Multi processor systems perform "real" memory operations (those that influence definitive executions, not just speculative execution) out of order and asynchronously as waiting for global synchronization of global state would needlessly stall all executions nearly all the time. On the other hand, immediately outside each individual core, it seems that the memory system, starting with L1 cache, is purely synchronous, consistent, flat from the allowed behavior point of view (allowed semantics); obviously timing depends on the cache size and behavior.

So on a CPU there on one extreme are named "registers" which are private by definition, and on the other extreme there is memory which is shared; it seems a shame that outside the minuscule space of registers, which have peculiar naming or addressing mode, the memory is always global, shared and globally synchronous, and effectively entirely subject to all fences, even if it's memory used as unnamed registers, for the purpose of storing more data than would fit in the few registers, without a possibility of being examined by other threads (except by debugging with ptrace which obviously stalls, halts, serializes and stores the complete observable state of an execution).

Is that always the case on modern computers (modern = those that can reasonably support C++ and Java)?

Why doesn't the dedicated L1 cache provide register-like semantics for those memory units that are only used by a particular core? The cache must track which memory is shared, no matter what. Memory operations on such local data doesn't have to be stalled when strict global ordering of memory operations are needed, as no other core is observing it, and the cache has the power to stall such external accesses if needed. The cache would just have to know which memory units are private (non globally readable) until a stall of out of order operations, which makes then consistent (the cache would probably need a way to ask the core to serialize operations and publish a consistent state in memory).

Do all CPU stall and synchronize all memory accesses on a fence or synchronizing operation?

Can the memory be used as an almost infinite register resource not subject to fencing?

About the L1 cache with register-like semantics... I suspect that is problematic if a thread is suspended by the OS and resumes on another core. — LWimsey, May 23 '19 at 08:40
@LWimsey It's an issue for sure but register could be demoted (promoted?) to normal modified data after a full stall (caused by a mode switch or something) and migrated like normal data. — curiousguy, May 23 '19 at 10:45
The *unit that execute a sequence of CPU instructions* is a "core". An "execution unit" is a component of a core like a shifter or integer-multiplier, or load-store unit, that does the actual work for one kind of instruction. But not any decoding or tracking of register contents; the rest of the core exists to keep the execution units fed with work and keep track of the results. e.g. see a block diagram of Haswell's execution ports and the units on each port, and the scheduler that feeds them: https://www.realworldtech.com/haswell-cpu/4/. (And a later page for a full diagram of the core) — Peter Cordes, May 23 '19 at 18:27
@PeterCordes Not an architecture expert, I really meant: that thing with one (user visible) PC register. — curiousguy, May 23 '19 at 21:10
Ok, then yes you mean "core". I'll edit your question if I have time later, and you haven't done so yourself. — Peter Cordes, May 23 '19 at 22:30

Peter Cordes · Accepted Answer · 2019-05-25T03:47:24.343

In practice, a single core operating on memory that no other threads are accessing doesn't slow down much in order to maintain global memory semantics, vs. how a uniprocessor system could be designed.

But on a big multi-socket system, especially x86, cache-coherency (snooping the other socket) is part of what makes memory latency worse for cache misses than on a single-socket system, though. (For accesses that miss in private caches).

Yes, all multi-core systems that you can run a single multi-threaded program on have coherent shared memory between all cores, using some variant of the MESI cache-coherency protocol. (Any exceptions to this rule are considered exotic and have to be programmed specially.)

Huge systems with multiple separate coherency domains that require explicit flushing are more like a tightly-coupled cluster for efficient message passing, not an SMP system. (Normal NUMA multi-socket systems are cache-coherent: Is mov + mfence safe on NUMA? goes into detail for x86 specifically.)

While a core has a cache line in MESI Modified or Exclusive state, it can modify it without notifying other cores about changes. M and E states in one cache mean that no other caches in the system have any valid copy of the line. But loads and stores still have to respect the memory model, e.g. an x86 core still has to commit stores to L1d cache in program order.

L1d and L2 are part of a modern CPU core, but you're right that L1d is not actually modified speculatively. It can be read speculatively.

Most of what you're asking about is handled by a store buffer with store forwarding, allowing store/reload to execute without waiting for the store to become globally visible.

what is a store buffer? and Size of store buffers on Intel hardware? What exactly is a store buffer?

A store buffer is essential for decoupling speculative out-of-order execution (writing data+address into the store buffer) from in-order commit to globally-visible L1d cache.

It's very important even for an in-order core, otherwise cache-miss stores would stall execution. And generally you want a store buffer to coalesce consecutive narrow stores into a single wider cache write, especially for weakly-ordered uarches that can do so aggressively; many non-x86 microarchitectures only have fully efficient commit to cache for aligned 4-byte or wider chunks.

On a strongly-ordered memory model, speculative out-of-order loads and checking later to see if any other core invalidated the line before we're "allowed" to read it is also essential for high performance, allowing hit-under-miss for out-of-order exec to continue instead of one cache miss load stalling all other loads.

There are some limitations to this model:

limited store-buffer size means we don't have much private store/reload space
a strongly-ordered memory model stops private stores from committing to L1d out of order, so a store to a shared variable that has to wait for the line from another core could result in the store buffer filling up with private stores.
memory barrier instructions like x86 mfence or lock add, or ARM dsb ish have to drain the store buffer, so stores to (and reloads from) thread-private memory that's not in practice shared still has to wait for stores you care about to become globally visible.
conversely, waiting for shared store you care about to become visible (with a barrier or a release-store) has to also wait for private memory operations even if they're independent.

I understand that the x86 store buffer unlike a cache has to keep all non repeated memory operations like (a=1;b=2;a=3;b=4) and you can't remove anything as there is no way to correctly make both a and b globally visible w/o following that exact sequence. But a cache only stores the latest value of a and b because a cache has no time dimension. (Obv that's different if you reorder stores.) — curiousguy, May 23 '19 at 22:00
@curiousguy: that's true for x86, with strong store ordering. A weakly ordered ISA like AArch64 could maybe coalesce non-adjacent stores to the same line (or same 8-byte chunk), at least after the stores "graduate" (store instruction retires from the ROB), because that would mean any loads of the older value have also been executed. — Peter Cordes, May 23 '19 at 22:09
@curiousguy: but even on x86, if those stores are to the *same* line, it's always allowed to make 2 sequential things simultaneous, just not happen in the other order. So they could coalesce into one `ab=0x0000000400000003` entry in the store buffer, again after graduating. There's some reason to believe that modern x86 CPUs actually do some store coalescing for stores into the same cache line. — Peter Cordes, May 23 '19 at 22:12
Well if you suspend shared memory semantics (as in a single core system) you can remove all older writes in the buffer w/o restrictions. That would make the buffer more efficient IMO for typical access patterns where same location is updated several times (like an iterator object, that cannot be put in a register by the compiler) in alternance with other locations. — curiousguy, May 24 '19 at 01:50
@curiousguy: x86 memory ordering semantics always apply; DMA can observe memory in a single-core system. Historical single-core x86 CPUs (like P6) did respect the memory model for this reason, even on Write-Back memory regions I think. (Did you mean that comma? Most iterator objects *can* be optimized into a register. With a comma, you're saying that iterator objects in general can't be put in a register.) But anyway yes, to your real point, I guess you could make a store buffer that allowed stores to scratchpad memory physical addresses to commit out of order. — Peter Cordes, May 24 '19 at 02:02
The CPU would have to know from the physical address that it was non-shared, though. P6 used a northbridge, not integrated memory controllers, and MMIO device memory can be mapped as write-back, so (without knowing that the target address is just DRAM) a store buffer *can't* know that it's safe to reorder. Plus you'd need to build extra functionality for coalescing into your store buffer which would be used more rarely than on a weakly-ordered system. You can't necessarily use scratchpad memory for stack space because you can pass a pointer to a stack object to other threads... — Peter Cordes, May 24 '19 at 02:09
"_you're saying that iterator objects_ ..." An Java an iterator is an Object, dynamically allocated and with an identity; it can be shared. Usually only a C/C++ object with a identity but allocated on the stack (automatic object) can be optimized correctly. — curiousguy, May 24 '19 at 02:10
@curiousguy: Oh, I was assuming C++. But with escape analysis, most Java / C# objects that are private to a scope inside a function and in practice *not* shared can be identified. This allows optimizations by the JIT compiler into locals, like C++ automatic storage (on the stack or registers). https://www.beyondjava.net/escape-analysis-java and https://shipilev.net/jvm/anatomy-quarks/18-scalar-replacement/ which explains how the (HotSpot?) JVM replaces some fields with synthetic scalar equivalents, not exactly allocating the exact Object. Anyway, avoiding actual heap `new` is *important*. — Peter Cordes, May 24 '19 at 02:19
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/193904/discussion-between-curiousguy-and-peter-cordes). — curiousguy, May 24 '19 at 19:16

score 2 · Answer 2 · answered May 23 '19 at 19:23

the memory is always global, shared and globally synchronous, and effectively entirely subject to all fences, even if it's memory used as unnamed registers,

I'm not sure what you mean here. If a thread is accessing private data (i.e., not shared with any other thread), then there is almost no need for memory fence instructions¹. Fences are used to control the order in which memory accesses from one core are seen by other cores.

Why doesn't the dedicated L1 cache provide register-like semantics for those memory units that are only used by a particular execution unit?

I think (if I understand you correctly) what you're describing is called a scratchpad memory (SPM), which is a hardware memory structure that is mapped to the architectural physical address space or has its own physical address space. The software can directly access any location in an SPM, similar to main memory. However, unlike main memory, SPM has a higher bandwidth and/or lower latency than main memory, but is typically much smaller in size.

SPM is much simpler than a cache because it doesn't need tags, MSHRs, a replacement policy, or hardware prefetchers. In addition, the coherence of SPM works like main memory, i.e., it comes into play only when there are multiple processors.

SPM has been used in many commercial hardware accelerators such as GPUs, DSPs, and manycore processor. One example I am familiar with is the MCDRAM of the Knights Landing (KNL) manycore processor, which can be configured to work as near memory (i.e., an SPM), a last-level cache for main memory, or as a hybrid. The portion of the MCDRAM that is configured to work as SPM is mapped to the same physical address space as DRAM and the L2 cache (which is private to each tile) becomes the last-level cache for that portion of MCDRAM. If there is a portion of MCDRAM that is configured as a cache for DRAM, then it would be the last-level cache of DRAM only and not the SPM portion. MCDRAM has a much higher bandwdith than DRAM, but the latency is about the same.

In general, SPM can be placed anywhere in the memory hierarchy. For example, it could placed at the same level as the L1 cache. SPM improves performance and reduces energy consumption when there is no or little need to move data between SPM and DRAM.

SPM is very suitable for systems with real-time requirements because it provides guarantees regarding the maximum latency and/or lowest bandwdith, which is necessary to determine with certainty whether real-time constraints can be met.

SPM is not very suitable for general-purpose desktop or server systems where they can be multiple applications running concurrently. Such systems don't have real-time requirements and, currently, the average bandwdith demand doesn't justify the cost of including something like MCDRAM. Moreover, using an SPM at the L1 or L2 level imposes size constraints on the SPM and the caches and makes difficult for the OS and applications to exploit such a memory hierarchy.

Intel Optance DC memory can be mapped to the physical address space, but it is at the same level as main memory, so it's not considered as an SPM.

Footnotes:

(1) Memory fences may still be needed in single-thread (or uniprocessor) scenarios. For example, if you want to measure the execution time of a specific region of code on an out-of-order processor, it may be necessary to wrap the region between two suitable fence instructions. Fences are also required when communicating with an I/O device through write-combining memory-mapped I/O pages to ensure that all earlier stores have reached the device.

"_Such systems don't have real-time requirements_" Actually they do when they perform cryptographic operations that don't need to be done fast, but need to be done in value independent time. Also for password checking. (They can be hashed than compared, then the comparison doesn't have that time constraint.) — curiousguy, May 24 '19 at 20:05
@curiousguy Not really. Constant time doesn't mean real-time, these are different things. A real-time task must be completed according to time constraints. As long as the time constraints are met, the task succeeds irrespective of whether it is completed in constant time or not. Moreover, doing something in constant time doesn't imply that it meets specific time constraints. A task could be both constant time and real time or neither. SPM is useful for real-time systems, but not so much for constant-time implementations. — Hadi Brais, May 25 '19 at 15:07
SPM doesn't make memory access more predictable? It doesn't remove informations leaks via memory cache? — curiousguy, May 25 '19 at 16:12
@curiousguy (1) Yes that's one of the reasons why it's useful for real-time systems as my answer already describes. (2) SPM does have security-related uses, but i's not that simple and there is no space in the comment section to provide a detailed answer. In general, though, the answer is No, it doesn't remove information leaks by itself. — Hadi Brais, May 25 '19 at 16:24

Is memory outside each core always conceptually flat/uniform/synchronous in a multiprocessor system?

2 Answers2