x86-64 usage of LFENCE

Question

I'm trying to understand the right way to use fences when measuring time with RDTSC/RDTSCP. Several questions on SO related to this have already been answered elaborately. I have gone through a few of them. I have also gone through this really helpful article on the same topic: http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf

However, in another online blog, there's an example of using LFENCE instead of CPUID on x86. I was wondering how LFENCE prevents earlier stores from contaminating the RDTSC measurements. E.g.

<Instr A>
LFENCE/CPUID
RDTSC
<Code to be benchmarked>
LFENCE/CPUID
RDTSC

In the above case, LFENCE ensures all earlier loads it complete before it (Since SDM says: LFENCE instructions cannot pass earlier reads.). But what about earlier stores (say, Instr A was a Store)? I understand why CPUID works because it IS a serialization instruction, but LFENCE is not.

One explanation I found was in Intel SDM VOL 3A Section 8.3, the following footnote:

LFENCE does provide some guarantees on instruction ordering. It does not execute until all prior instructions have completed locally, and no later instruction begins execution until LFENCE completes.

So essentially LFENCE acts like an MFENCE. In that case, why do we need two separate instructions LFENCE and MFENCE?

I'm probably missing something.

Thanks in advance.

score 12 · Accepted Answer · answered May 26 '16 at 19:34

The key point is the adverb locally in the quoted sentence "It does not execute until all prior instructions have completed locally".

I was unable to find a clear definition of "complete locally" the whole set of Intel manual, my speculation is explained below.

In order to be completed locally an instruction must have it output computed and available to the other instructions further down in its dependency chain. Furthermore any side effect of that instruction must be visible inside the core.

In order to be completed globally an instruction must have its side effects visible to other system components (like other CPUs).

If we don't qualify the kind of "completeness" we are talking about it generally means it don't care or it is implicit in the context.

For a lot of instructions being completed locally and globally, it is the same.
For a load for example, in order to be completed locally, some data must be fetched from memory or caches. This is the same as being completed globally, since we cannot mark the load complete if we don't read from the memory hierarchy first.

For a store however the situation is different.

Intel processors have a Store Buffer to handle writes to memory, from Chapter 11.10 of the manual 3:

Intel 64 and IA-32 processors temporarily store each write (store) to memory in a store buffer. The store buffer improves processor performance by allowing the processor to continue executing instructions without having to wait until a write to memory and/or to a cache is complete. It also allows writes to be delayed for more efficient use of memory-access bus cycles.

So a store can be completed locally by being put in the store buffer, from the core perspective the write is like it have gone all the way to the memory.
A load from the same core of the store, under specific circumstances, can even read back that value (this is called Store Forwarding).

To be completed globally however a store need to be drained from the Store Buffer.

Finally is mandatory to add that the Store Buffer is drained by Serializing instructions:

The contents of the store buffer are always drained to memory in the following situations:
• (P6 and more recent processor families only) When a serializing instruction is executed.
• (Pentium III, and more recent processor families only) When using an SFENCE instruction to order stores.
• (Pentium 4 and more recent processor families only) When using an MFENCE instruction to order stores.

Being done with the introduction, let's see what lfence, mfence and sfence do:

LFENCE does not execute until all prior instructions have completed locally, and no later instruction begins execution until LFENCE completes.

MFENCE performs a serializing operation on all load-from-memory and store-to-memory instructions that were issued prior the MFENCE instruction. MFENCE does not serialize the instruction stream.

SFENCE performs a serializing operation on all store-to-memory instructions that were issued prior the SFENCE instruction.

So lfence is weaker form of serialization that doesn't drain the Store Buffer, since it effectively serialize instructions locally, all loads before it must be completed before it completes.

sfence serializes stores only, it basically doesn't allow the process to execute any more store until sfence is retired. It also drains the Store buffer.

mfence is not a simple combination of the two because it is not serializing in the classical sense, it is a sfence that also prevent future loads to be executed.

It may be worth nothing that sfence was introduced first and the other twos came later to achieve a more granular control over the memory ordering.

Finally, I was used to close a rdtsc instruction between two lfence instructions, to be sure no reordering "backward" and "forward" was possible.
However I'm sure about this technique soundness.

Thanks for the elaborate response. So if I understand correctly, LFENCE does not drain the store buffer, but it does make the CPU wait until all prior load and store instructions have completed locally. In that case, we can't rely on it for time (RDTSC) measurement at the end of our benchmark code right? Because, you want to ensure that writes have been made global (flushed to memory) before measuring the time. Thanks. — Chandan, May 27 '16 at 16:42
`lfence` can be used for measurement *if you don't want to wait* for the stores to become globally visible. Writing to memory takes a lot of cycles, and if you don't account for caching carefully it will gives inconsistent results. Usually one leave writes to memory out the benchmark, unless you want to explicitly test for them. In that case use `lfence` with `sfence` or a serializing instruction that don't overwrite the registers you need. — Margaret Bloom, May 27 '16 at 16:57
@MargaretBloom I believe 'locally complete' just means that the data has been loaded from the cache and returned to the load buffer. Usually loads would be allowed to proceed as soon as the TLB / cache port is available. LFENCE prevents this and ensures that all loads before it have retired. LFENCe will disappear when it is at the head of the load buffer. Usually a store can retire as soon as it receives the TLB privilege. SFENCE ensures that a store after it doesn't get selected for query until the SFENCE disappears meaning all the stores before it are selected first. — Lewis Kelsey, Feb 14 '19 at 14:40
The most likely thing in my mind is that when the SFENCE is at the head of the queue, it causes a delay on the stores for 4-5 cycles to ensure that the previous non-line-fill-buffer store has been committed but also ensures that there are no line fill buffers for the logical core that are waiting to write to cache. That is one theory. — Lewis Kelsey, Feb 14 '19 at 14:40

score 1 · Answer 2 · answered May 26 '16 at 14:04

As you rightfully observed, it is a matter of serialization. Regarding to your question

why do we need two separate instructions LFENCE and MFENCE?

is answered in the Intel SDM in section "5.6.4 - SSE2 Cacheability Control and Ordering Instructions":

LFENCE Serializes load operations
MFENCE Serializes load and store operations

So LFENCE is probably used because MFENCE isn't necessary for RDTSC.

x86-64 usage of LFENCE

2 Answers2

Linked