4

I have seen the related question including here and here, but it seems that the only instruction ever mentioned for serializing rdtsc is cpuid.

Unfortunately, cpuid takes roughly 1000 cycles on my system, so I am wondering if anyone knows of a cheaper (fewer cycles and no read or write to memory) serializing instruction?

I looked at iret, but that seems to change control flow, which is also undesirable.

I have actually looked at the whitespaper linked in Alex's answer about rdtscp, but it says:

The RDTSCP instruction waits until all previous instructions have been executed before reading the counter. However, subsequent instructions may begin execution before the read operation is performed.

That second point seems to be make it less than ideal.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
merlin2011
  • 71,677
  • 44
  • 195
  • 329
  • Regarding the edit: Have you read the next section? They add CPUID exactly for that purpose (avoiding subsequent instructions from reordering above the RDTSCP) – Leeor May 03 '14 at 09:05

4 Answers4

8

Have you looked at the rdtscp instruction? This is the read serialized version of rdtsc.

For benchmarking I would recommend to read this whitepaper. It provides a couple of best practices for measuring clock ticks.

Alex(Intel)

Alexander Weggerle
  • 1,881
  • 1
  • 11
  • 7
  • 1
    Thanks for this answer. I actually did look at it before but forgot to add it to my post. I just updated my question. – merlin2011 Apr 25 '14 at 20:04
  • Given the original phrasing of the question, this is still the best answer. – merlin2011 Apr 26 '14 at 08:06
  • Did you had a look on the whitepaper I mentioned above? This explicitly provides ways to workaround the limitations of `RDTSCP`. But this unfortunately doesn't solve the overhead involved. – Alexander Weggerle Apr 28 '14 at 07:43
  • I perused it but have not had a chance to dig into it yet. – merlin2011 Apr 28 '14 at 07:48
  • `rdtscp` isn't serializing, it can reorder with later instructions. It's maybe good at the *end* of a timed region, but you might want `lfence` after it at the start. See [clflush to invalidate cache line via C function](https://stackoverflow.com/a/51830976) for an example. – Peter Cordes Aug 18 '18 at 16:12
5

For ordering rdtsc wrt. other instructions, lfence is sufficient if you don't need to wait for the store buffer to drain. Since always on Intel, since Spectre mitigation on AMD. See solution to rdtsc out of order execution?

rdtscp is also guaranteed to be ordered wrt. earlier instructions (but not later; in practice it's probably microcoded pretty much like lfence;rdtsc in that order, plus uops to write ECX with the processor ID.) It's not an x86 serializing instruction, and doesn't even drain the store buffer. (Which you wouldn't necessarily want for timing anyway.) You can mfence; rdtscp or lock or byte [rsp], 0 ; rdtscp if you want that, or rdtscp; lfence if you want to make sure its few uops can't reorder with later stuff.

See also this Q&A for more about the TSC in general, that it's a fixed frequency, not CPU cycles.


True serializing instructions

To answer the title question about "serializing instructions" in the x86 technical terminology sense,
Alder Lake (and Sapphire Rapids) and later have serialize, which does exactly that and no more.

lfence serializes instruction execution (drains the ROB but not store buffer): See

In a VM, cpuid is a guaranteed vmexit so it's slow. It could possibly be faster to push RSP, RFLAGS, CS, and RIP, and run an iret instruction. I didn't double-check what iret pops so that might not be exactly right.


When you need a true serializing instruction

Cross-modifying code is a case where a proper serializing instruction can matter vs. something like mfence;lfence. After an acquire load sees a release store indicating that the new code is there, you need to run a serializing instruction. Intel's Volume 3 manual, section 8.1.3, guarantees that's sufficient for cross-modifying code to be safe.

I assume that makes sure old code hasn't already been fetched by the front-end. So a serializing instruction might fully nuke the pipeline, or do the equivalent if there's enough tracking of recently instructions in the pipeline to snoop on L1i invalidations. (That extra snooping might not be worth the power since serializing instructions are hopefully rare. The tracking is needed anyway to handle self modifying code, snooping store addresses for being near any instruction in flight.)

mfence (or a lock or byte [rsp],0) + lfence wouldn't necessarily be strong enough since lfence only drains the ROB, concerned with instruction execution not fetch, and mfence deals with data load/store. cpuid is a good bet for this case if you can't use serialize.

(Even an atomic RMW or atomic store within an aligned 8-byte chunk in the writer isn't sufficient. On some microarchitectures, I think unaligned code-fetch of 16-byte chunks from L1i cache is possible, so the reader might tear at any boundary.)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Thanks for this updated answer to my old question! – merlin2011 Feb 16 '23 at 06:07
  • I guess the need for serialising instructions comes from they trying to apply cross-modifying code. In that case an instruction that serialise the instruction stream is needed. – Quân Anh Mai Feb 16 '23 at 08:47
  • @QuânAnhMai: possibly a building block as part of that? Instruction fetch boundaries aren't guaranteed to be 16-byte aligned chunks, so even if the writer does atomic stores or RMWs contained within an aligned 8-byte chunk, the reader might be about to start a fetch part way through that block, anywhere there was previous an instruction boundary. So you need something more than just a serializing instruction. IDK if that's even guaranteed to drain or flush the front-end. Stale instruction fetch is impossible on current CPUs from *self*-modifying code, but IDK about pipeline nukes on x-mod – Peter Cordes Feb 16 '23 at 09:15
  • 1
    @PeterCordes Intel Intel Developer’s Manual section Volume 3 section 8.1.3 talks about self- and cross-modifying code in which the procedure is to acquire the modified code and run a serialising instruction before executing it. As a result I think a serialising instruction would nuke the entire pipeline. – Quân Anh Mai Feb 16 '23 at 09:55
  • @QuânAnhMai: Ah I see, very interesting. I'd wondered for a long time if serializing was any stronger than draining ROB + store buffer, and for data it probably isn't. But for code-fetch it would have to do something more than `mfence+lfence`, since Intel's manual guarantees that procedure for cross modifying code would work. – Peter Cordes Feb 16 '23 at 10:43
  • You can imagine a mechanism that can avoid a flush of fetch buffers if recently-fetched instructions were from cache lines that weren't invalidated in the meantime, perhaps building on the same mechanisms as for tracking self-modifying code. But otherwise not worth it, just actually discard fetched instructions since cross-modification of code is rare enough. – Peter Cordes Feb 16 '23 at 10:44
1

The answer is apparently not. The Intel Manual, Volume 3a lists only 3 non-privileged serializing instructions (cpuid, iret, and rsm), and the latter two seem to have control-flow side-effects.

merlin2011
  • 71,677
  • 44
  • 195
  • 329
  • Upcoming CPUs (like Sapphire Rapids or later) will have `serialize`, which does exactly that and no more. `lfence` serializes instruction execution (drains the ROB but not store buffer): [Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths](https://stackoverflow.com/q/51986046) / [Why is (or isn't?) SFENCE + LFENCE equivalent to MFENCE?](https://stackoverflow.com/q/27627969) . (In practice I think `mfence`;`lfence` is probably as strong as `cpuid` for anything that could matter.) – Peter Cordes Feb 15 '23 at 05:58
-1

Well,I guess this is helpfull:lfence.Ref this 《64-ia-32-architectures-software-developer-manual》 Vol.2B 4-301

ioilala
  • 277
  • 2
  • 10