Multi processor systems perform "real" memory operations (those that influence definitive executions, not just speculative execution) out of order and asynchronously as waiting for global synchronization of global state would needlessly stall all executions nearly all the time. On the other hand, immediately outside each individual core, it seems that the memory system, starting with L1 cache, is purely synchronous, consistent, flat from the allowed behavior point of view (allowed semantics); obviously timing depends on the cache size and behavior.
So on a CPU there on one extreme are named "registers" which are private by definition, and on the other extreme there is memory which is shared; it seems a shame that outside the minuscule space of registers, which have peculiar naming or addressing mode, the memory is always global, shared and globally synchronous, and effectively entirely subject to all fences, even if it's memory used as unnamed registers, for the purpose of storing more data than would fit in the few registers, without a possibility of being examined by other threads (except by debugging with ptrace which obviously stalls, halts, serializes and stores the complete observable state of an execution).
Is that always the case on modern computers (modern = those that can reasonably support C++ and Java)?
Why doesn't the dedicated L1 cache provide register-like semantics for those memory units that are only used by a particular core? The cache must track which memory is shared, no matter what. Memory operations on such local data doesn't have to be stalled when strict global ordering of memory operations are needed, as no other core is observing it, and the cache has the power to stall such external accesses if needed. The cache would just have to know which memory units are private (non globally readable) until a stall of out of order operations, which makes then consistent (the cache would probably need a way to ask the core to serialize operations and publish a consistent state in memory).
Do all CPU stall and synchronize all memory accesses on a fence or synchronizing operation?
Can the memory be used as an almost infinite register resource not subject to fencing?