For ordering rdtsc
wrt. other instructions, lfence
is sufficient if you don't need to wait for the store buffer to drain. Since always on Intel, since Spectre mitigation on AMD. See solution to rdtsc out of order execution?
rdtscp
is also guaranteed to be ordered wrt. earlier instructions (but not later; in practice it's probably microcoded pretty much like lfence
;rdtsc
in that order, plus uops to write ECX with the processor ID.) It's not an x86 serializing instruction, and doesn't even drain the store buffer. (Which you wouldn't necessarily want for timing anyway.) You can mfence; rdtscp
or lock or byte [rsp], 0 ; rdtscp
if you want that, or rdtscp; lfence
if you want to make sure its few uops can't reorder with later stuff.
See also this Q&A for more about the TSC in general, that it's a fixed frequency, not CPU cycles.
True serializing instructions
To answer the title question about "serializing instructions" in the x86 technical terminology sense,
Alder Lake (and Sapphire Rapids) and later have serialize
, which does exactly that and no more.
lfence
serializes instruction execution (drains the ROB but not store buffer): See
In a VM, cpuid
is a guaranteed vmexit so it's slow. It could possibly be faster to push RSP, RFLAGS, CS, and RIP, and run an iret
instruction. I didn't double-check what iret pops so that might not be exactly right.
When you need a true serializing instruction
Cross-modifying code is a case where a proper serializing instruction can matter vs. something like mfence
;lfence
. After an acquire load sees a release store indicating that the new code is there, you need to run a serializing instruction. Intel's Volume 3 manual, section 8.1.3, guarantees that's sufficient for cross-modifying code to be safe.
I assume that makes sure old code hasn't already been fetched by the front-end. So a serializing instruction might fully nuke the pipeline, or do the equivalent if there's enough tracking of recently instructions in the pipeline to snoop on L1i invalidations. (That extra snooping might not be worth the power since serializing instructions are hopefully rare. The tracking is needed anyway to handle self modifying code, snooping store addresses for being near any instruction in flight.)
mfence
(or a lock or byte [rsp],0
) + lfence
wouldn't necessarily be strong enough since lfence
only drains the ROB, concerned with instruction execution not fetch, and mfence
deals with data load/store. cpuid
is a good bet for this case if you can't use serialize
.
(Even an atomic RMW or atomic store within an aligned 8-byte chunk in the writer isn't sufficient. On some microarchitectures, I think unaligned code-fetch of 16-byte chunks from L1i cache is possible, so the reader might tear at any boundary.)