2

What is the difference in behaviour between the MFENCE/LFENCE/SFENCE instruction as an intrinsic put right before the instruction it's meant to serialize (A) and the instruction inside a function that must be called before the instruction that needs to be serialized (B)?

So basically the difference between

(A):

LFENCE
RDTSC

(B):

Fence PROC
LFENCE
RET
Fence ENDP

...

CALL Fence
RDTSC
Carol Victor
  • 331
  • 1
  • 7
  • 1
    No difference, except that sfence or mfence would also serialize the write to the stack. – prl Mar 31 '21 at 16:54
  • 1
    Why are you talking about intrinsics when you only have asm source code? In asm you write *instructions* directly, without having to ask the compiler to generate the asm you want. (See [When should I use \_mm\_sfence \_mm\_lfence and \_mm\_mfence](https://stackoverflow.com/a/50780314) for an answer that's actually about intrinsics in C++, not asm directly.) – Peter Cordes Mar 31 '21 at 19:39
  • Also note that only LFENCE is guaranteed to serialize execution relative to RDTSC, because serializing instruction execution is part of LFENCE but not the others. MFENCE does on some CPUs as an implementation details, and SFENCE does on some other CPUs (maybe AMD). – Peter Cordes Mar 31 '21 at 19:45
  • @prl: sfence wouldn't serialize call's store relative to rdtsc. Think of it like a grocery-store checkout conveyor belt divider for the store buffer; it doesn't block retirement or exec in the ROB, only serializes L1d commit order. (IIRC it's stronger on AMD CPUs, but it's not true in general across x86, and this question is even tagged [intel] for some reason.) [Why is (or isn't?) SFENCE + LFENCE equivalent to MFENCE?](https://stackoverflow.com/a/50322404) – Peter Cordes Mar 31 '21 at 19:48

1 Answers1

1

It depends on what's exactly is in the Fence function. In particular, it depends what's between the fence and the rdtsc. It also depends on what's after rdtsc.

Consider the lfence case and where rdtsc is at the top of the timed region. Since Fence is being called using the call instruction, there is probably a ret at the end of that function to go back to the following rdtsc. This means that there is at least a ret between the lfence and the rdtsc. Most probably the ret here is of the form C3, which gets decoded and allocated into the reservation station as two uops on modern Intel and AMD processors. These uops are used to load the return address from the stack and verify the prediction, so there is a true data dependency between them and current processors don't use value prediction.

If the load hits in the L1D and the DTLB or STLB, or if the value is forwarded from the store buffer (this is possible because lfence doesn't wait for the store buffer to drain), it's unlikely that there will be a difference between having lfence placed immediately before rdtsc and having a ret between the two instructions. But if the load takes a long time, rdtsc may have already been executed and later instructions would be in-flight as well in the backend. After the load completes, there is still another uop from ret to be executed waiting in the RS. This uop consumes certain resources and could interfere with all the other uops that are in the timed region and may affect the measured time. Note that even with your simple Fence function, a hardware interrupt could occur just before RET, making store forwarding impossible and may end up evicting the return address from the L1D. Anyway, unless you hit a pathological instruction sequence in the timed region, this doesn't matter unless you really want extreme precision.

You'd normally want to place lfence immediately before rdtsc. You can use a macro instead of a function or force the compiler to inline the function if possible (but even then you still have to examine the generated asm code and make sure it's what you want).

sfence doesn't interact with ret or rdtsc, so there is no ordering effects with respect to these instructions. mfence forces the load from ret to wait until most earlier memory-related operations reach the point of global oveservability or persistence. mfence and sfence alone don't serialize rdtsc.

Hadi Brais
  • 22,259
  • 3
  • 54
  • 95
  • 1
    There's no data dependency between the `ret` and `rdtsc`, and nothing is blocking branch prediction / speculative execution. (The return predictor should pretty definitely predict correctly unless there's an interrupt before the ret; this ret matches the last call). So the call has to execute its push and jump uops before lfence. If this call fence / rdtsc is at the bottom of a timed region, the call instruction is part of what's waited for but ret isn't. If it's at the top, then there's a ret that can exec in parallel with getting the start time. – Peter Cordes Mar 31 '21 at 21:07
  • TL:DR: of what I meant to say: the `ret` is *after* the `lfence`, so the data dependency between its component uops shouldn't be relevant, except for which cycle it's going to need to run its port 6 (Intel) uop while the back-end is working on the `rdtsc` uops or uops of whatever real work follows. – Peter Cordes Mar 31 '21 at 21:10
  • @PeterCordes Yeah I've written the answer while only thinking about the case where `rdtsc` is at the top of the timed region. Even at the bottom could also interfere with the measurement. The call/ret could push `rdtsc` to the next group of uops to be allocated, for example. – Hadi Brais Mar 31 '21 at 21:23