2

The documentation for sfence says:

Performs a serializing operation on all store-to-memory instructions that were issued prior the SFENCE instruction.

What does "serializing operation" mean?

Does it mean make sure all store-to-memory instructions that were issued prior to the sfence instruction are completed before continuing executing the instructions after sfence?

BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
James
  • 703
  • 8
  • 17
  • 2
    Have you read the rest of the Intel documentation? There ought to be some section that explains what a serializing operation is. – fuz May 23 '18 at 10:07
  • Yes, serializing instructions are also full memory barriers; they effectively flush the whole pipeline ([Does lock xchg have the same behavior as mfence?](https://stackoverflow.com/q/40409297)), unlike `lfence` on Intel, which *just* serializes instruction execution, not the store buffer. [Why is (or isn't?) SFENCE + LFENCE equivalent to MFENCE?](https://stackoverflow.com/a/50322404) – Peter Cordes May 23 '18 at 11:08
  • You are reading an old version of the doc. The [new version](http://www.felixcloutier.com/x86/SFENCE.html) removes the language above and replaces it with much clearer language that says, among other things _The processor ensures that every store prior to SFENCE is globally visible before any store after SFENCE becomes globally visible._ which is about a good one-sentence summary as you are going to get. – BeeOnRope May 24 '18 at 03:11

3 Answers3

3

sfence makes sure that all prior stores in program order become globally visible before any later stores in program order become globally visible. There are two differences compared to what you've written. First, sfence does not serialize issued prior stores; it serializes all prior stores irrespective of whether they have been issued or not. Second, it serializes with respect to only all later stores; not all later instructions. That's what is meant by "serializing operation" within the context of sfence.

You've quoted only the first sentence from the documentation, but every sentence matters.

Hadi Brais
  • 22,259
  • 3
  • 54
  • 95
  • 1
    I think (and hope) the OP is using Intel's terminology, where "issued" means "renamed and inserted into the out-of-order core". That happens in program order, so all instructions before `sfence` will have already issued. You're assuming the other convention, where issued/dispatched swap meanings. You're right that `sfence` orders *all* previous stores, not only ones that have already dispatched to an execution port and written something into the store buffer. – Peter Cordes May 24 '18 at 00:57
  • 1
    @PeterCordes - probably Intel is just playing fast and loose with terminology here. I don't think they really meant to say that the ordering has anything to do with dispatch or issue (regardless of the naming convention) since that's an internal detail not useful in any ISA doc (and if they were talking about issue to an EU it really makes no sense since that happens out of order). They really just mean "that came first in program order" like Hadi says. It's also confusing they mention stores that happen before, but not stores that happen after, as if there is some asymmetry - but there isn't. – BeeOnRope May 24 '18 at 03:05
  • 1
    That's again probably just some leaking of internal details: the way it works is mostly by ensuring that any outstanding stores (weakly ordered ones) go into L1 or are otherwise ordered - but the thing is really a two-way fence for stores. The [current doc](http://www.felixcloutier.com/x86/SFENCE.html) has totally changed the language and is much clearer, talking not about "serialization" but about the ordering guarantees, and removing the hint of asymmetry. Serialization is only mentioned in the short description: "Serializes store operations." – BeeOnRope May 24 '18 at 03:07
  • The use of the word _serialization_ is also a bit overloaded in the Intel manual since a "serializing instruction" is a special thing (as you are aware), and `sfence` is not one of those. IMO the current text is much better. Also I lied when I said the hint of asymmetry is gone - the first sentence is: _Orders processor execution relative to all memory stores prior to the SFENCE instruction._ The next setence, however, clears it up IMO: _The processor ensures that every store prior to SFENCE is globally visible before any store after SFENCE becomes globally visible._ – BeeOnRope May 24 '18 at 03:08
  • 1
    @BeeOnRope Is that just me or is there suddenly a massive uptick of interest in [x86] memory-order/barrier/fence questions? – Iwillnotexist Idonotexist May 24 '18 at 04:12
  • @Peter Cordes *"I think (and hope) the OP is using Intel's terminology, where "issued" means "renamed and inserted into the out-of-order core". That happens in program order"* What I mean by "issue" is that when the CPU execute some store instruction, the instruction will not be completed immediately, but rather it will be placed into the "store buffer" or something (I don't know much about these details), and then when `sfence` is executed, all previous uncompleted instructions will be completed. – James May 24 '18 at 05:01
  • @James: Oh, then no. Only a full serializing instruction like `cpuid` would do require all previous instructions (in program order) to retire, *and* flush the store buffer to L1d cache. ([How many memory barriers instructions does an x86 CPU have?](https://stackoverflow.com/q/50323347)). `sfence` only controls the order of operations coming out the far end of the store buffer, not blocking out-of-order *execution* of any instructions. – Peter Cordes May 24 '18 at 05:07
  • @IwillnotexistIdonotexist: It's not just you. I have no clue what prompted all these recent questions about terminology / definitions / basics. (Please vote to make `[memory-fences]` a synonym of `[memory-barriers]`, if you haven't already: https://stackoverflow.com/tags/memory-barriers/synonyms – Peter Cordes May 24 '18 at 05:08
  • @Peter Cordes What I am interested in is that at some point in my thread (I have a program with multiple threads written in x86 Assembly), I want to have some instruction that when the CPU executes, it will cause the CPU to make sure that all previous stores becomes visible to the other threads, so are you saying that `sfence` will not accomplish that, and that I should use `cpuid`? – James May 24 '18 at 05:15
  • @James: previous stores always become visible to other threads as quickly as possible. If you mean you want to delay any later loads until that happens, use `mfence`; it's a full memory barrier (including StoreLoad, blocking the only kind of reordering that x86 allows: http://preshing.com/20120710/memory-barriers-are-like-source-control-operations/ and http://preshing.com/20120515/memory-reordering-caught-in-the-act/). If you actually want to block even ALU instructions, then you need a serializing instruction like `cpuid`, but IDK why you'd ever need that for correctness. – Peter Cordes May 24 '18 at 05:25
  • @James - SFENCE is essentially useless on x86 unless you are using rather obscure stuff like weakly-ordered (NT) stores. Maybe you should ask a question describing what you are trying to do, what is a good outcome and what is a bad outcome. Discussing the details of hardware implementations, while interesting for us, probably won't help you: you just need to rely on the guarantees offered by the ISA regardless of how it is interested. An example would help. – BeeOnRope May 24 '18 at 06:05
  • @IwillnotexistIdonotexist - yeah, it's not just you. Not sure what's up! – BeeOnRope May 24 '18 at 06:25
3

The English word Serial - adjective form:

  1. occurring in a series rather than simultaneously

  1. Computers.
    a) of or relating to the apparent or actual performance of data-processing operations one at a time (distinguished from parallel).

    b) of or relating to the transmission or processing of each part of a whole in sequence, as each bit of a byte or each byte of a computer word (distinguished from parallel).

(Serialization can also mean converting an object representation to a bit-stream or byte-stream which can be stored to disk or sent over a network outside of the program. But that's not the meaning that applies in the context of sfence).

Database https://en.wikipedia.org/wiki/Serializability is a more closely related concept.


SFENCE orders the global visibility of earlier stores with respect to SFENCE itself, and later stores. Serializing = imposing an order on things, stopping them from overlapping or happening in parallel.


Note that in Intel terminology, "serializing instruction" has a special meaning: an instruction that flushes the store buffer and the out-of-order instruction pipeline before any later instructions can execute. (They can decode and maybe even issue into the out-of-order core, but not execute). How many memory barriers instructions does an x86 CPU have?

sfence is not a "serializing instruction" in that sense; it only orders NT stores with respect to each other and regular stores. (Regular stores are already ordered with respect to each other, so sfence has no effect if there are no NT stores in flight. All you need for correct release semantics is to put regular stores in the right order, e.g. with a compiler barrier to stop compile-time reordering.)

"serializing" in Intel's definition of sfence is just the plain English meaning of the term, not the "serializing instruction" x86 special meaning.


Current wording of Intel's ISA ref manual entry for sfence:

Intel rewrote the opening paragraph to say "orders" instead of "serializes", except in the short description: Serializes store operations.

The main Description is:

Orders processor execution relative to all memory stores prior to the SFENCE instruction. The processor ensures that every store prior to SFENCE is globally visible before any store after SFENCE becomes globally visible. The SFENCE instruction is ordered with respect to memory stores, other SFENCE instructions, MFENCE instructions, and any serializing instructions (such as the CPUID instruction). It is not ordered with respect to memory loads or the LFENCE instruction.

The first sentence is still kind of bogus, though. Execution isn't ordered, only commit to L1d cache.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1
    You might want to point out that Intel has removed the text the OP posted from the current version of the manual, so we are all discussing something obsolete where the current text is much more clear and doesn't rely on "serializing" but rather "ordering". – BeeOnRope May 24 '18 at 06:06
  • @BeeOnRope: good idea, updated and quoted. The first sentence of the Description is kind of bogus, talking about ordering *execution* relative to stores. We know what they mean, but if we didn't we might wonder if it did have some effect on out-of-order execution, rather than just the commit end of the store buffer. – Peter Cordes May 24 '18 at 06:17
  • Yeah the first sentence still has a hint of confusion... but it says "relative to all memory stores", so you could read it as "orders _processor execution_ only as is necessary to order stores" which turns out to be "not at all". If sort of makes sense if you think of the second sentence, which is really the key, as clarifying and being specific about the result of the first. It's unfortunate that it says _processor execution_ since that sounds tantalizing like some kind of instruction serialization, but we know it's not... – BeeOnRope May 24 '18 at 06:19
  • @PeterCordes am I right in thinking that the rename stage handles SFENCE by not issuing any more uops until it knows the store buffer is empty? – Lewis Kelsey Feb 14 '19 at 04:52
  • 1
    @LewisKelsey: That's a valid but *very* slow way to implement SFENCE! And that would make it far stronger than needed, like MFENCE. (However, AMD SFENCE might be something like this; AMD gives stronger guarantees for SFENCE than Intel). SFENCE only needs to insert a marker into the store buffer that stops NT stores from passing it in either direction. (Regular stores are already strongly ordered). See the "bonus reading" section of this answer: [Does a memory barrier acts both as a marker and as an instruction?](//stackoverflow.com/q/50338253) – Peter Cordes Feb 14 '19 at 06:21
  • @PeterCordes what I suggested did sound inefficient yes. I see now. SFENCE goes into the store buffer and prevents a store after the SFENCE being selected to execute instead of the store at the head of the queue (presumably would have been reordered based on what cache line it is on -- I'm guessing that the same cache line cannot be accessed on 2 cache ports at the same time so it would have selected a store that doesn't conflict). – Lewis Kelsey Feb 14 '19 at 07:46
  • 1
    @LewisKelsey: two consecutive stores to the same line can be *merged* or coalesced inside the store buffer, and commit together in one cache write operation. See [Unexpectedly poor and weirdly bimodal performance for store loop on Intel Skylake](//stackoverflow.com/posts/comments/82834278) and the answer that comment is under. SFENCE may prevent that coalescing, but otherwise x86 not doing StoreStore reordering means that stores have to commit from the store buffer to L1d in program order (unlike on a weakly-ordered ISA). – Peter Cordes Feb 14 '19 at 08:43
  • Only NT stores (which go straight to an LFB, not into L1d) are really affected by SFENCE because normally they *can* commit out of order. Also, recent Intel CPUs like Skylake *can* read + write the same cache line in the same cycle (or else store coalescing saves us?), but IvyBridge / Haswell can't. [Weird performance effects from nearby dependent stores in a pointer-chasing loop on IvyBridge. Adding an extra load speeds it up?](//stackoverflow.com/posts/comments/95100851) (see that whole Q&A, not just the comment; the RS replaying uops after guessing when a load would be ready is, uh, fun.) – Peter Cordes Feb 14 '19 at 08:43
3

An sfence prevents stores before the fence from being re-ordered with respect to stores after the fence. That's it. Don't focus on the "serializing" part: Intel has removed the text you quoted from the current version of the manual (you linked an obsolete source).

The new text says1 (emphasis mine):

Orders processor execution relative to all memory stores prior to the SFENCE instruction. The processor ensures that every store prior to SFENCE is globally visible before any store after SFENCE becomes globally visible. The SFENCE instruction is ordered with respect to memory stores, other SFENCE instructions, MFENCE instructions, and any serializing instructions (such as the CPUID instruction). It is not ordered with respect to memory loads or the LFENCE instruction.

Weakly ordered memory types can be used to achieve higher processor performance through such techniques as out-of-order issue, write-combining, and write-collapsing. The degree to which a consumer of data recognizes or knows that the data is weakly ordered varies among applications and may be unknown to the producer of this data. The SFENCE instruction provides a performance-efficient way of ensuring store ordering between routines that produce weakly-ordered results and routines that consume this data.

The second (emphasized) line is the key: this guys is there to orders stores.

It doesn't (necessarily) make stores become visible sooner - that happens naturally on a coherent architecture like x86. It doesn't necessarily serialize instructions surrounding the fence, including stores: it just makes sure stores aren't apparently reordered across the barrier.

Here's a secret though: this instruction is mostly useless in x86 code. The x86 memory model already guarantees that normal stores are already exactly ordered with respect to each other: stores from a given CPU become visible in program order to all other CPUs, so sfence doesn't add anything. The only exceptions, where sfence can be useful is with relatively obscure stuff like non-temporal stores or really obscure stuff like WC memory types. If you aren't using that, you don't need this instruction.


1 I've also linked an unofficial source as there is no official HTML source that I'm aware of - but I checked that it is up-to-date on sfence as of May 2018.

BeeOnRope
  • 60,350
  • 16
  • 207
  • 386