Do locked instructions provide a barrier between weakly-ordered accesses?

Question

On x86, lock-prefixed instructions such as lock cmpxchg provide barrier semantics in addition to their atomic operation: for normal memory access on write-back memory regions, reads and writes are not re-ordered across lock-prefixed instructions, per section 8.2.2 of Volume 3 of the Intel SDM:

Reads or writes cannot be reordered with I/O instructions, locked instructions, or serializing instructions.

This section applies only to write-back memory types. In the same list, you find an exception where it notes that weakly ordered stores are not ordered:

Reads are not reordered with other reads.

Writes are not reordered with older reads.

Writes to memory are not reordered with other writes, with the following exceptions: —

streaming stores (writes) executed with the non-temporal move instructions (MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD); and —

string operations (see Section 8.2.4.1).

Note that there is no exception made for non-temporal instructions in any other items in the list, e.g., in the item referring to lock-prefixed instructions.

In various other sections of the guide, it is mentioned that the mfence and/or sfence instructions can be used to order memory when weakly ordered (non-temporal) instructions are used. These sections generally don't mention lock-prefixed instruction as an alternative.

All that leaves me uncertain: do lock-prefixed instructions provide the same full barrier that mfence provides between weakly ordered (non-temporal) instructions on WB memory? The same question applies again but to any type of access on WC memory.

Hadi Brais · Answer 1 · 2020-02-25T13:10:17.043

7

On all 64-bit AMD processors, MFENCE is a fully serializing instruction and the Lock-prefixed instructions are not. However, both serialize all memory accesses according to the AMD manual V2 7.4.2:

All previous loads and stores complete to memory or I/O space before a memory access for an I/O, locked or serializing instruction is issued.

All loads and stores associated with the I/O and locked instructions complete to memory (no buffered stores) before a load or store from a subsequent instruction is issued.

There are no exceptions or erratum related to the serialization properties of these instructions.

It's clear from the Intel manual and documents that both serialize all stores with no exceptions or related erratum. MFENCE also serializes all loads, with one errata documented for most processors based on Skylake, Kaby Lake, and Coffee Lake microarchitectures, which states that MOVNTDQA from WC memory may passs earlier MFENCE instructions. In addition, many processors based on the Nehalem, Sandy Bridge, Ivy Bridge, Haswell, Broadwell, Skylake, Kaby Lake, Coffee Lake, and Silvermont microarchitectures have an errata that says that MOVNTDQA from WC memory may passs earlier locked instructions. Processors based on the Core, Westmere, Sunny Cove, and Goldmont microarchitectures don't have this errata.

The quote from Necrolis's answer says that the lock prefix may not serialize load operations that reference weakly ordered memory types on the Pentium 4 processors. My understanding is that this looks like a bug in the Pentium 4 processors and it doesn't apply to any other processors. Although it's worth noting that it's not documented in the spec update documents of the Pentium 4 processors.

@PeterCordes's experiments show that, on Skylake, locking instructions don't seem to block ALU instructions from being executed out-of-order while mfence does serialize ALU instructions (potentially behaving identically to lfence + a store-buffer flush like a locked instruction). However, I think this is an implementation detail.

edited Feb 25 '20 at 13:10

answered May 12 '18 at 00:19

Hadi Brais

22,259
3
54
95

I wonder why does `mfence` take so much longer to execute than locked instructions if it is strictly less powerful? – BeeOnRope May 12 '18 at 00:56
The part about prefetch not being ordered with respect to the various instructions is either odd or "obvious" to me. The prefetch instructions don't have any architectural effect (visible output) and are essentially hints to move lines into the cache - they need to be consumed by subsequent load instructions, which must play by the rules of memory ordering. So I can't see any scenario (outside of performance) where the "prefetch is not ordered wrt X" clause would allow a result that "prefetch is ordered wrt X" would not. – BeeOnRope May 12 '18 at 01:04
@BeeOnRope Two simple examples where software prefetch ordering matters that I can think of now. 1- when changing the memory addressing mode (paging, protection, paging structures), this changes the way the prefetch address is interpreted. 2- prefetches may set the accessed bit in one or more segment descriptors and paging structures, which is an observable architectural change. In addition, there are complex interactions between software prefetching and other behavior of the processor. The ordering of prefetch instructions certainly matters and is designed intentionally like that. – Hadi Brais May 12 '18 at 01:27
@BeeOnRope Regarding your first comment. I'm not sure why. You have to describe in detail how you're measuring performance, which I have not done. I suggest posting it as a new question if you like. – Hadi Brais May 12 '18 at 01:30
1

At the moment I'm just looking at Agner's tables, which all list `mfence` as ~33 cycles, and complex locked instructions like `lock cmpxchg` as 15-20 cycles in general. I haven't checked personally yet, but I suspect it's correct as this whole "don't use `mfence` just use a no-op `lock op` instruction instead" has been conventional performance wisdom for a while. So I doubt `mfence` is significantly cheaper than listed. I might ask another question, but "why is (obscure) instruction X faster than (obscure) instruction Y" type question tend to be close/downvote magnets, in my experience. – BeeOnRope May 12 '18 at 23:07
@BeeOnRope I feel you. Now I'm thinking that maybe you should not bother with posting the question because I've read in multiple documents that instructions that only to offer weaker ordering guarantees might be implemented internally exactly like instructions that offer stronger ordering guarantees. It's just that the guarantees are not publicly documented. But we can never know for sure unless Intel tells us. – Hadi Brais May 12 '18 at 23:19
2

Well I don't find the weaker-implemented-as-stronger to be a problem, really: that's Intel leaving open a door for future optimizations and/or different implementations in their ISA documentation. There isn't much downside to this: people who are OK with the weaker guarantees use the weaker instructions, and even if they happen to end up with stronger behavior, they are at least "tied" with the guaranteed-strong behavior of other instructions. – BeeOnRope May 12 '18 at 23:23
1

This `mfence` case is kind of the opposite: perhaps `mfence` has stronger guarantees (your answer says it doesn't, but I can see the opposite case also - see [some arguments here](https://stackoverflow.com/a/50279772/149138)) but if it doesn't on the current architecture (you say it is strictly weaker), why wouldn't Intel _implement_ it with the same underlying primitives as the other instructions? Not doing it makes no sense to me. OTOH, documenting instructions as having possibly weaker semantics, but not always (or ever) implementing that on every arch makes perfect sense. – BeeOnRope May 12 '18 at 23:26
@BeeOnRope Well at the ISA-level, it's stronger with respect to other fence instructions, not lock-prefixed onces. At the implementation level, mfence could have the same serializing properties as the lock prefix but certainly cannot provide atomicity. – Hadi Brais May 12 '18 at 23:33
@BeeOnRope I've quickly went through your answer on that question, which seems to be the opposite of my answer. But you seem to have missed the quotes I mentioned in my answer, which might have led you the other way. – Hadi Brais May 12 '18 at 23:41
Yes, it's the opposite, but I think actually it's wrong and that it's more likely that `lock` instructions do serialize weakly ordered memory accesses but are just implemented differently which makes them faster - at least in a back-to-back latency test. Iit seems that _perhaps_ `mfence` has some additional serializing behavior in the case of obscure stuff like `cflushopt` and (maybe, maybe not) prefetch instructions. – BeeOnRope May 17 '18 at 17:52
@BeeOnRope According to the Intel manual V2, both `MFENCE` and locked instructions serialize both `CLFLUSH` and `CLFLUSHOPT`. In addition, as also mentioned in the answer, both are not ordered with respect to `PREFETCHh` and `PREFETCHW`. – Hadi Brais May 17 '18 at 18:31
I was looking in the "ordering" section 8.2.2 in V3, which AFAIK is basically the primary list of ordering rules and there they list ordering restrictions between `cflushopt` and the various fences, but none with locked instructions. You are right that in the description of `CFLUSHOPT` in V2 they do mention that it is ordered by locked instructions. I mentioned prefetch because in their `mfence` patent the only use example Intel gives is an `mfence` used to separate a `prefetch` instruction from a non-temporal store. – BeeOnRope May 17 '18 at 18:50
It could be that despite the caveat in the manual, prefetch is ordered by `mfence` (and perhaps locked instructions) on current architectures, but Intel doesn't want to _lock_ themselves into that behavior. – BeeOnRope May 17 '18 at 18:51
@BeeOnRope It seems to me that there is no inconsistency between what's written in V2 and V3. The ordering guarantees in V2 are a proper subset of those in V3. Also in V2, they only say `CLFLUSH and CLFLUSHOPT cannot pass **earlier** LFENCE` but for other fences it works both directions as mentioned in V2. What's up with that? In V3, the ordering rules are same for all fences including LFENCE and works in both directions, which makes more sense. – Hadi Brais May 17 '18 at 19:01
The manual has many such inconsistencies. They don't even use the word _serialize_ (as in _serializing instruction_) consistently. Still, I'd probably take V2 at its word here: if they document in `cflushopt` that locked instructions serialize, they probably do. Sometimes it is easier to understand the manual if you consider that it was not written at one moment in time, but rather in many increments with new information being added as new processors are added. The level of detail and format the new information takes is not necessary the same as the old. – BeeOnRope May 17 '18 at 19:04
You can see this a lot in the optimization manual, where it was written at one point for some ancient architecture, and then updated many times for new ones, but the update process isn't exact or exhaustive. So you have many sections that only mention old architectures, and they may apply to new architectures, or not, or the advice may even be the opposite of what a new architecture wants. New sections get added which apply to new architectures, which may or may not apply to old, and so on. Same for the system programming guide: perhaps when `sfence` and `mfence` and friends were added ... – BeeOnRope May 17 '18 at 19:08
... the went through and added new lines for them in the ordering rules, which mentioned `cflush` and (later) `clushopt` because that was pointed out in the discussion between the doc person and the engineer talking to them, but then they didn't go back and update the lock lines. Similarly, you often have the same concept touched on in many places (e.g., serialization, instruction ordering and memory ordering), but sometimes only some places get updated when a new behavior or instruction or guarantee is documented, leaving the others out of date and fodder for debates. – BeeOnRope May 17 '18 at 19:09
@BeeOnRope Yea totally agree. I'd add also the use of different terms to refer to the same thing, undefined terms, vague sentences, and incomplete description of functionality or rules. When I said `no inconsistency` I was only referring to the *functional correctness* of those particular parts of the manual we're discussing. But I think that V2 is more accurate and more update-to-date and more well-written than V3. I don't think Section 8.2.2 has been updated for a long time now. – Hadi Brais May 17 '18 at 19:12
@BeeOnRope Maybe it'd be nice if Intel created a GitHub repo for the manuals so people can submit issues and stuff. – Hadi Brais May 17 '18 at 19:17
Ooops, I read it as "there is an inconsistency" not "there is no inconsistency", so the initial part of my response doesn't make sense. I'm not sure what you mean by proper subset in this case. Are you talking about the "set" of constraints on top of an un-ordered model (so that, for example, a set of rules that allowed only store-load re-ordering would be a **superset** of one that allowed load-load and store-load re-ordering - since the former has more guarantees)? Or do you mean the opposite, like that the set is the list of possible re-orderings on top of a nominally sequentially ... – BeeOnRope May 17 '18 at 19:23
... consistent model, so that in the above example the former rules would be a **subset**? In either case, I don't see the rules in 8.2.2 being a subset or superset of V2: V2 provides strong guarantees in the case of `cflushopt`, but many weaker guarantees in other cases since it doesn't even discuss ordering in general (i.e., you won't find the disallowing of store-store re-ordering in there, AFAIK). That's kind of the point of 8.2.2 to list all the ordering guarantees in one place rather than trying to spread them across the instruction doc, where in many cases there is no obvious "home". – BeeOnRope May 17 '18 at 19:25
BTW, I think 8.2.2. is updated fairly frequently, for e.g., it refers to `clflushopt` which is a recent instruction (added only in Skylake?). This section has also undergone many revisions and was the subject of much debate in the past when the model was weaker than reality, poorly defined and so on. There are even academic papers which try to formalize. That's why the section is so large and has so many litmus tests and stuff. So I think you can say that this section definitely has had an above-average amount of effort poured into it. – BeeOnRope May 17 '18 at 19:31
Also, the description for `LFENCE` isn't surprising to me. As far as I know, lfence simply stalls execution until all older instructions have executed (retired?). That is enough to fence load instructions which take effect at execution (and is only needed for loads that bypass the normal ordering guarantees). The behavior with respect to a "store like" instruction (such as `clflush*`) then would be to prevent later store like instructions from passing, since they don't even start until the lfence has serialized instruction execution, not earlier ones due to store buffer. – BeeOnRope May 17 '18 at 19:38
@BeeOnRope I meant that the ordering guarantees regarding `clflush(opt)` and the fence and locked instructions as mentioned in 8.2.2 are more relaxed (less guarantees) than those in V2. Yes, this may not apply to everything in V2 and V3. But I wouldn't call this an inconsistency, it's more like incompleteness. – Hadi Brais May 17 '18 at 19:41
@BeeOnRope I wrote an [article](https://hadibrais.wordpress.com/2018/05/14/the-significance-of-the-x86-lfence-instruction/) on lfence a couple of days ago based on what I found in the Intel and AMD manuals and related patents and research papers. One thing that I'm planning to add to the article is how it's related to `MOVNTDQA` and a rough discussion of how lfence gets executed, so it's still an ongoing (slowly) work. – Hadi Brais May 17 '18 at 19:50
@BeeOnRope I thought of creating a question on Stack Overflow to basically create a "home" for all ordering rules in x86 considering all instructions and all memory types and even 3D XPoint DIMMs. But I'm not sure this is a good idea; people may not welcome such questions. – Hadi Brais May 17 '18 at 19:58
No doubt it would be hard to sandwich a canonical document on all the ordering into a single question in the Q&A format. It would have been a better fit for something like the SO doc feature (which was eliminated). I think the best we can hope for is a series of questions on the key points with good/canonical answers, and perhaps links to them in the tag wiki or some master question if an appropriate one can be desired. That said, I'm not personally an SO rules stickler, so if someone wants to try to shoehorn it into one question I welcome the effort. – BeeOnRope May 17 '18 at 23:20
1

Good article. I added a comment there regarding something I didn't understand and a few other points. – BeeOnRope May 17 '18 at 23:20

score 3 · Answer 2 · answered May 10 '18 at 21:46

3

Bus locks (via the LOCK opcode prefix) produce a full fence*, however, on WC memory they don't provide the load fence, this is documented in the Intel 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A, 8.1.2:

For the P6 family processors, locked operations serialize all outstanding load and store operations (that is, wait for them to complete). This rule is also true for the Pentium 4 and Intel Xeon processors, with one exception. Load operations that reference weakly ordered memory types (such as the WC memory type) may not be serialized.

*See Intel's 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A, 8.2.3.9 for an example

answered May 10 '18 at 21:46

Necrolis

25,836
3
63
101

Their CPU generation naming is really bad in manual. It seems like a lot of it was updated in the P4 era. What do they mean by "Xeon processors"? It's clear it can't refer to Xeon _branded_ processors since those have existed on almost every micro-architecture and don't correlate with basic behavior like this. So the above doesn't make it clear what the state is for new processors. – BeeOnRope May 11 '18 at 03:21
@BeeOnRope I totally agree, but I tend to take it as meaning "From P6 to present", and when I see "Xeon" I tend to take that as the IA64 (Itanium) ISA; which has different memory semantics than x86. I suppose the only really definitive way would be to construct a test platform (I don't recall if IACA monitors memory semantics in this regard). – Necrolis May 11 '18 at 07:58
Well `P6` refers to the PPro architecture that _preceded_ by many years the P4 (in the first case, the "6" is a continuation of 486, 586 (Pentium), 686 (PPro) - while in the P4 case it's the numerical continuation of Pentium I, Pentium II, Pentium III). Of course, the chips that followed the P4 were derivatives of the P6 architecture, so it is possible the advice for P6 applies both before _and_ after the P4. There are, however, places where P6 definitely only refers to old chips prior to P4, e.g., since it is explicit from the context that they are 32-bit only. – BeeOnRope May 12 '18 at 00:47
I don't think Xeon could refer to Itanium as these guides only cover x86 and x86-64 architecture (what Intel calls IA-32 and Intel-64). – BeeOnRope May 12 '18 at 00:48

Do locked instructions provide a barrier between weakly-ordered accesses?

2 Answers2

Linked

Related