X86 linearizability?

Question

The X86 doesn't provide sequential consistency (SC) out of the box.

X86 provides TSO; so it will provide the following barriers for free

[LoadLoad]
[LoadStore]
[StoreStore]

Regular loads provide acquire semantics.

r1=A
[LoadLoad]
[LoadStore]
...

Regular stores provide release semantics.

...
[StoreStore]
[LoadStore]
X=r1

So X86 for regular loads and stores provide acquire/release semantics.

This isn't sufficient for SC, e.g.

[StoreStore]
[LoadStore]
X=r1
r2=Y
[LoadStore]
[LoadLoad]

In this case the store and load can still be reordered and hence it isn't SC. To remedy this issue a [StoreLoad] barrier can be added (e.g. a MFENCE).

[StoreStore]
[LoadStore]
X=r1
[StoreLoad]<--
r2=Y
[LoadStore]
[LoadLoad]

So now we have upgraded from acquire/release semantics to SC.

On most cases reads are more frequent than writes, so it is most beneficial to do the [StoreLoad] with the write.

[StoreStore]
[LoadStore]
X=r1
[StoreLoad]

My question is about linearizability. The difference between linearizability and SC is that with SC the effect of an operation can be skewed in front of the invocation start or after the invocation completion, but with linearizability it is required that the effect of the invocation is between invocation start and invocation completion.

This brings me to question; can X86 provide linearizability?

Lets first determine invocation start and completion:

Invocation start: the issuing of the instruction; so when a entry on the ROB is reserved.

Invocation completion: the removal of the instruction of the ROB (e.g. in case of a store when the item is moved from the SB to the L1D).

A load will become globally visible when it reads the data either from cache or from memory. This is after start and before completion. MESI protocol will prevent the load reading a stale value.

A store will become globally visible when the stores leaves the SB and hits the L1d. This is also between invocation start and completion.

So to me it looks that X86 can provide linearizability.

My definition of completion above is broken. Removal of the store from the ROB doesn't mean it is globally visible. For that it needs to leave the SB. — pveentjer, May 20 '20 at 06:51

Peter Cordes · Answer 1 · 2020-05-12T12:16:15.423

Stores don't commit to L1d when they retire from the ROB. That would unnecessarily tie execution to commit, losing some of the benefit of hiding occasional cache-miss stores. (A benefit which applies even to in-order CPUs.)

When a store retires from the ROB, the store buffer entry "graduates" and becomes a candidate for commit to L1d. Commit can't happen before retirement. It happens some time after, once it gets to the head of the SB queue (on x86 where commit is in program order). Commit to L1d is the moment at which it becomes globally visible.

(The store buffer always drains itself into the ROB as fast as it can. mfence or a locked instruction merely makes this core wait for that to happen before executing later loads.)

If I've understood your definition of "linearizable" correctly, you need extra barriers beyond merely memory barriers to provide it.

lfence serializes execution in the out-of-order back-end (draining the ROB before later instructions issue), so mfence + lfence can I think fully serialize execution + memory commit by putting such a barrier between two instructions you want to keep fully separate. (e.g. after a store, before an rdtsc that will record when the store buffer was drained.)

Or use a serializing instruction like cpuid. The technical term Intel uses in their manuals is "serializing instruction" for one that can't start until previous instructions are retired, and drains the store buffer before later instructions can issue. This is I think what you're calling "linearization". MFENCE/SFENCE/etc "serialize memory but not instruction execution"?

How many memory barriers instructions does an x86 CPU have? lists x86's serializing instructions.

Or if you define "invocation completion" as "commit to L1d", then linearization is the same as SC on x86 and pretty much every ISA: once a store is committed to L1d cache, it's globally visible to all cores. And pretty much by definition a core isn't done tracking its own store until that happens.

All CPUs that we run threads across have cache-coherent shared memory so no explicit flushing is needed to ensure visibility, and being in coherent L1d = globally visible. MESI coherency requires that a cache line is exclusively owned by a core before it's modified.

Thanks for your reply. If a load is implemented using `r1=X,[LoadLoad][LoadStore]...` and a store is implemented using `...[StoreStore][LoadStore],Y=r2,[StoreLoad]....` is this considered linearizable on the X86? It is SC.. but does the X86 guarantee that effect of the instruction is within invocation start/completion? — pveentjer, May 12 '20 at 05:38
@pveentjer: It depends how you define "invocation completion". You gave two conflicting definitions of retirement from ROB vs. commit from SB to L1d. I didn't notice that until now, updated my answer. — Peter Cordes, May 12 '20 at 12:16
Thanks for the answer. I understand that my definition is broken, I was lacking an important insight which I gained last week. If completion is defined as the store leaving the SB and if the appropriate barriers are in place, X86 can provide linearizability; not just sequential consistency. With sequential consistency loads and stores can be skewed as long as program order isn't violated. But with linearizability this isnt allowed. — pveentjer, May 20 '20 at 06:59
A question about serializing instructions. It stops the front end from issuing instructions. But what happens to newer loads that have already been issued? With an MFENCE these get blocked until the SB has been drained.. but will they be blocked by a serializing instruction? So will a serializing instruction stop both front end from issuing instructions and stop the loads from executing, or will it only stop the front end from issuing instructions but already issued loads to continue? — pveentjer, May 26 '20 at 07:36
There can't be any instructions in the back end younger than a serializing instruction (including loads). The whole point is that it (logically at least) drains everything, then fully executes + retires itself, then allows the front-end to issue later instructions. i.e. the front end stalls / pauses on serializing instructions just like it does for LFENCE. — Peter Cordes, May 26 '20 at 07:44

X86 linearizability?

1 Answers1