Does processor stall during cache coherence operation

Question

Let's assume that variable a = 0

Processor1: a = 1
Processor2: print(a)

Processor1 executes it's instruction first then in next cycle processor2 reads variable to print it. So is:

processor2 gonna stall until cache coherence operation completes and it will print 1

P1:   |--a=1--|---cache--coherence---|----------------
P2:   ------|stalls due to coherence-|--print(a=1)---|
time: ----------------------------------------------->

processor2 will operate before cache coherence operation completes and it will have stale memory view until then. So it will print 0?
```
P1:   |--a=1--|---cache--coherence---|
P2:   ----------|---print(a=0)---|----
time: ------------------------------->
```
In other words can processor have stale memory view until cache coherence operations are completed?

It depends on the temporal occurences (race). As soon as as processor 1 updates a, the corresponding cache line in processor 2 is invalidated. So if processor 2 read is slightly before, its cache line is still valid and the old value is fetched. If it is slightly after, cache line is invalidated and the cache has to ask to get the new value. No way to know. — Alain Merigot, Apr 01 '19 at 22:05

Peter Cordes · Accepted Answer · 2019-04-08T23:17:02.190

All modern ISAs use (a variant of) MESI for cache coherency. This maintains coherency at all times of the shared view of memory (through cache) that all processors have.

See for example Can I force cache coherency on a multicore x86 CPU? It's a common misconception that stores go into cache while other cores still have old copies of the cache line, and then "cache coherence" has to happen.

But that's not the case: to modify a cache line, a CPU needs to have exclusive ownership of the line (Modified or Exclusive state of MESI). This is only possible after receiving responses to a Read For Ownership that invalidates all other copies of the cache line, if it was in Shared or Invalid state before. See Will two atomic writes to different locations in different threads always be seen in the same order by other threads? for example.

However, memory models allow local reordering of stores and loads. Sequential consistency would be too slow, so CPUs always allow at least StoreLoad reordering. See also Is mov + mfence safe on NUMA? for lots of details about the TSO (total store order) memory model used on x86. Many other ISAs use an even weaker model.

For an unsynchronized reader in this case, there are three possibilities if both are running on separate cores

load(a) happens on core#2 before the cache line is invalidated, so it reads the old value and thus effectively happens before the a=1 store in the global order. The load can hit in L1d cache.
load(a) happens after core#1 has committed the store to its L1d cache, and hasn't written back yet. Core#2's read request triggers Core#2 to write-back to shared a shared level of cache (e.g. L3), and puts the line into Shared state. The load will definitely miss in L1d.
load(a) happens after write-back to memory or at least L3 has already happened, so it doesn't have to wait for core#1 to write-back. The load will miss in L1d unless hardware prefetch has brought it back in for some reason. But usually that only happens as part of sequential accesses (e.g. to an array).

So yes, the load will stall if the other core has already committed it to cache before this core tries to load it.

See also Size of store buffers on Intel hardware? What exactly is a store buffer? for more about the effect of the store buffer on everything, including memory reordering.

It doesn't matter here because you havea write-only producer and a read-only consumer. The producer core doesn't wait for its store to become globally visible before continuing, and it can see its own store right away, before it becomes globally visible. It does matter when you have each thread looking at stores done by the other thread; then you need barriers, or sequentially-consistent atomic operations (which compilers implement with barriers). See https://preshing.com/20120515/memory-reordering-caught-in-the-act

See also Can num++ be atomic for 'int num'? for how atomic RMW works with MESI, that's instructive to understanding the concept. (e.g. that an atomic RMW can work by having a core hang on to a cache line in Modified state, and delay responding to RFO or requests to share it until the write part of the RMW has committed.)

It's worth noting that it doesn't have to be that way. In a no-allocate write policy, on a miss, all copies of the line in private caches need only be invalidated before the write is made globally observable. In another hypothetical design, the cache could be used like a store buffer where a cache line can be in non-globally-observable state and tagged with the core ID that has performed it so that it is the only core that can see it. In this design, the write can be performed in the cache until coherence is somehow "triggered." Only then is the write made globally observable. — Hadi Brais, Apr 01 '19 at 23:25
Such design may make sense for ISAs with weak memory models. — Hadi Brais, Apr 01 '19 at 23:26
`to modify a cache line, a CPU needs to have it in Modified state` - or Exclusive, that's the point in E state, otherwise you may as well just use MSI. — Leeor, Apr 05 '19 at 21:47
@Leeor: I was trying to simplify the sentence; actually modifying does always leave it in M state. So I meant to say it has be able to flip it to M state, or something. Anyway, I think this is a useful simplification for this case, and is arguably not really wrong. — Peter Cordes, Apr 05 '19 at 22:11
@PeterCordes, ok, i'd just say it has to *own* the line in the first place, which means no one else can have a copy. — Leeor, Apr 08 '19 at 23:10

Hadi Brais · Answer 2 · 2020-02-25T03:10:32.543

The read and write access to a in this example are concurrent and they may complete in any order. It depends on which processor gets to access the line first. Cache coherence only guarantees that all processors in the same coherence domain agree on values stored in all cache lines. So the end result cannot be that there are two copies of a, one with a value of 0 and the other is 1.

If you want to make sure that processor2 sees the value written by processor1, then you have to use a synchronization mechanism. A simple, but inefficient way of achieving this is:

Processor1: 
a = 1
rel = 1

Processor2: 
while(rel != 1){ }
print(a)

This works if the following properties are satisfied:

Stores are completed in order both at the compiler level and at the ISA level.
Loads are completed in order both at the compiler level and at the ISA level.

An example of an ISA that satisfies these properties is x86-64, assuming that rel is no larger than 8 bytes and naturally aligned and all variables are not allocated from a memory region of the WC memory type.

Regarding your update to the question. If processor1 has obtained ownership of the line before it is read by processor2, then processor2 may stall until processor1 completes its write operation and gets the updated line. Processor1 may decide to relinquish ownership of the line before it writes to it if it detects a read request to the line from another processor, but it has to be performed in such a way that a livelock will not occur. This is actually a standard example of how livelocks can occur in coherence. Based on Intel's spec update documents, in the Intel Pentium 4 processors, a line allocated due to an RFO request will not be evicted until it's accessed at least once, precisely to prevent livelocks from occurring. This also explains why it's not easy to support speculative RFOs for WB stores that have not retired yet.

I don't think the design of a core giving up a line before writing to it is impractical - in fact, I think modern Intel cores work exactly like that. A core will try to get more than one line by looking ahead into the store buffer and fetching lines that are about to be written. However, if a read request comes in before the line is actually written, the core may relinquish it. It would be dangerous not to, in terms of performance and deadlock. Intel patents describe this method in detail and also mechanisms to reduce the lookahead if it fails often in this way. — BeeOnRope, Apr 03 '19 at 04:40

Does processor stall during cache coherence operation

2 Answers2

Linked