Multicopy atomicity vs Cache Coherence

Question

Can you explain what's difference between multicopy atomicity and cache coherence? How are they related?

In practical terms, a memory ordering model that's multi-copy atomic forbids IRIW reordering. For example, IBM POWER CPUs are cache-coherent but *not* multi-copy atomic: [Will two atomic writes to different locations in different threads always be seen in the same order by other threads?](https://stackoverflow.com/a/50679223) . A few other ISAs have allowed IRIW reordering on paper but not done it in practice. I'm not sure why it's called that or what the technical definition is. — Peter Cordes, Aug 20 '22 at 12:19

pveentjer · Accepted Answer · 2022-08-20T12:26:14.963

Coherence:

There is a total order over all loads and stores to a single location
A read sees the most recent write in this total order
This order is consistent with the (preserved) program order.

So coherence only say something about loads/stores to a single location, but not about different addresses; that is the task of consistency.

If you need a total order over multiple addresses, you need multi-copy atomicity. So CPU's can't disagree on the order of stores issued by different CPUs to different addresses.

The typical example of this is the IRIW litmus test.

int a=0
int b=0

thread1:
   a=1

thread2
   b=1

thread3:
   r1=a
   [LoadLoad]
   r2=b

thread4:
   r3=b
   [LoadLoad]
   r4=a

Can it be that r1=1, r2=0, r3=1, r4=0. So can different CPUs see the stores in different orders? Since these are loads and stores to different addresses, (in)coherence is not the issue here.

In a system that is multi-copy atomic, the above situation can't happen.

Most modern CPUs are multi-copy atomic btw (x86, ARMv8). The modifications to a cache line are linearizable because the moment (linearization point) the cache line becomes visible to all other CPUs is between the start of writing to the cache (and waiting for the RFO acknowledgements) and the completion of writing to the cache. Because linearizability is composable, the whole cache is linearizable. And because it is linearizable, there exist always some total order and that is exactly what is needed for multi-copy atomicity.

That doesn't imply that the hardware memory model always has a total order even though it is build on top of a coherent cache. E.g. due to store-to-load forwarding or sharing the store buffer with a subset of the cores, you could lose the total order.

A great book on the topic you can download for free.

Where does the name "multi-copy atomic" come from? Something about a single coherent+consistent shared state existing, and private caches having copies of it? I've never seen an explanation of what the name means (although your discussion of a "linearization point" comes closest), only that it implies all threads can agree on the order of any two stores. [Unlike in POWER](https://stackoverflow.com/a/50679223) where it's allowed on paper and really does happen in practice, via store-forwarding between SMT threads. — Peter Cordes, Aug 20 '22 at 12:25
The terminology seems confusing here. ARMv8 may be multi-copy atomic from the point of view of the L1 cache, but it is explicitly stated in the manual that it is *not* multi-copy atomic from the point of view of software. It is instead "other multi-copy atomic": there does exist a total order, and all CPUs *except the one actually doing the stores* will observe that order. But the CPU actually doing the stores will instead observe them in its program order, and the two orders need not be consistent. — Nate Eldredge, Aug 20 '22 at 16:56

Multicopy atomicity vs Cache Coherence

1 Answers1