2

If you have two threads in the same processor, you can have a torn read/write.

For example, on a 32 bit system with thread 1 and thread 2 running on the same core:

  1. Thread 1 assigns a 64 bit int 0xffffffffffffffff to a global variable X, which is initially zero.
  2. The first 32 bits is set to the first 32 bits is set in X, now X is 0xffffffff00000000
  3. Thread 2 reads X as 0xffffffff00000000
  4. Thread 1 writes the last 32 bits.

The torn read happens in step 3.

But what if the following conditions are met:

  1. Thread 1 and Thread 2 are pinned to different cores
  2. The system uses MESI protocol to achieve cache coherence

In this case, is the torn read still possible? Or would the cache line be seen as invalidated in step 3, thereby preventing the torn read?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
intprx
  • 91
  • 5

1 Answers1

3

Yes, you can have tearing.

A share-request for the line could come in between committing the two separate 32-bit stores. If they're done by separate instructions, the writing thread could even have taken an interrupt between the first and 2nd store, defeating any store coalescing in a store buffer (into aligned 64-bit commits like some 32-bit RISC CPUs are documented to do) that might normally make it hard to observe tearing in practice between separate 32-bit stores.

Another way to get tearing is if the read side loses access to the cache line after reading the first half, before reading the 2nd half. (Because it received and RFO (read for ownership) from the writer core.) The first read could see the old value, the 2nd read could see the new value.

The only way for this to be safe is if both the store and the load are each done as a single atomic access to L1d cache of the respective core.

(And if the interconnect itself doesn't introduce tearing; note the case of AMD K10 Opteron that tears on 8-byte boundaries between cores on separate sockets, but seems to have aligned-16-byte atomicity between cores in the same socket. x86 manuals only guarantee 8-byte atomicity, so the 16-byte atomicity is going beyond documented guarantees as a side effect of the implementation.)

Of course, some 32-bit ISAs have a load-pair or store-pair instruction, or (like x86) guaranteed atomicity for 64-bit aligned loads/stores done via the FPU / SIMD unit.


If tearing is normally possible, how would such a microarchitecture implement 64-bit atomic operations?

By delaying response to MESI requests to share or invalidate a line when it's in the middle of doing a pair of loads or pair of stores done with a special instruction that gives atomicity when a normal load-pair or store-pair wouldn't. The other core is stuck waiting for the response, so there has to be a tight limit on how long you can ever delay responding, otherwise starvation / low overall throughput progress is a problem.

A microarchitecture that normally does a 64-bit access to cache for load-pair / store-pair would get atomicity for free by splitting that one cache access into two register outputs.

But a low-end implementation might not have such wide cache-access hardware. Maybe only LL/SC special instructions have 2-register atomicity. (IIRC, some versions of ARM are like that.)

Further reading:

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Couple of follow questions: 1) Does that mean in the example I gave that the MESI request to invalidate the cache line would come after the two stores, and not before the first store? 2) And for your example on atomics, does 'not responding to MESI requests' imply that the other processor will have to wait around until it does get a response (and the atomic operation is complete)? Thanks for the reading resources! Very helpful. – intprx Oct 30 '20 at 05:53
  • @intprx: 1) It might, it might not. The possibility is there for tearing, depending on the timing. 2) I mean *delaying* response until it's safe, with a tight upper bound on how long a delay is possible. Because yes, the other core is stuck waiting until it gets a response. – Peter Cordes Oct 30 '20 at 05:59
  • Just curious at this point, but could you elaborate on how the upper bound of the delay works? First time I've heard of it. Specifically: 1) What if you hit the upper bound, does the atomic operation just fail and retry under the hood? 2) Is this upper bound related to the spurious failures of compare_exchange_weak somehow? – intprx Oct 30 '20 at 06:04
  • @intprx: there isn't a timer for the upper bound, you just design the atomic-store, atomic-load, or atomic-RMW mechanism to take at most 20 clock cycles between "locking" a cache line and unlocking it again. If 20 clocks is too high, you add more hardware support for doing it faster so the average or worst case for response delay is lower. And no, a CAS using `lock cmpxchg [mem], reg` can never spuriously fail. Although hypothetically if you *did* have a CPU with an actual deadline timer and variable internal latency, that would be a mechanism for spurious failure. – Peter Cordes Oct 30 '20 at 06:08
  • @intprx: as you probably already know, the normal mechanism for spurious `CAS_weak` failure is on LL/SC machines, where losing and re-gaining ownership of a cache line is enough to make it fail. Being allowed to spuriously fail is what makes it possible to implement CAS_weak as one LL / compare value / SC, instead of an LL/SC retry loop like CAS_strong which can only fail if the actual value differs, not from ABA changes, let alone false sharing or even (on CPUs where the "monitor" mechanism is coarse) writes to nearby cache lines. – Peter Cordes Oct 30 '20 at 06:11
  • Yes, re compare exchange weak failures I was asking about non-x86. But my guess as to how it works was totally off-base, as there was no actual timer :) – intprx Oct 30 '20 at 06:18