It depends on what's exactly is in the Fence
function. In particular, it depends what's between the fence and the rdtsc
. It also depends on what's after rdtsc
.
Consider the lfence
case and where rdtsc
is at the top of the timed region. Since Fence
is being called using the call
instruction, there is probably a ret
at the end of that function to go back to the following rdtsc
. This means that there is at least a ret
between the lfence
and the rdtsc
. Most probably the ret
here is of the form C3, which gets decoded and allocated into the reservation station as two uops on modern Intel and AMD processors. These uops are used to load the return address from the stack and verify the prediction, so there is a true data dependency between them and current processors don't use value prediction.
If the load hits in the L1D and the DTLB or STLB, or if the value is forwarded from the store buffer (this is possible because lfence
doesn't wait for the store buffer to drain), it's unlikely that there will be a difference between having lfence
placed immediately before rdtsc
and having a ret
between the two instructions. But if the load takes a long time, rdtsc
may have already been executed and later instructions would be in-flight as well in the backend. After the load completes, there is still another uop from ret
to be executed waiting in the RS. This uop consumes certain resources and could interfere with all the other uops that are in the timed region and may affect the measured time. Note that even with your simple Fence
function, a hardware interrupt could occur just before RET
, making store forwarding impossible and may end up evicting the return address from the L1D. Anyway, unless you hit a pathological instruction sequence in the timed region, this doesn't matter unless you really want extreme precision.
You'd normally want to place lfence
immediately before rdtsc
. You can use a macro instead of a function or force the compiler to inline the function if possible (but even then you still have to examine the generated asm code and make sure it's what you want).
sfence
doesn't interact with ret
or rdtsc
, so there is no ordering effects with respect to these instructions. mfence
forces the load from ret
to wait until most earlier memory-related operations reach the point of global oveservability or persistence. mfence
and sfence
alone don't serialize rdtsc
.