How to ensure that RDTSC is accurate?

Question

I've read that RDTSC can gives false readings and should not be relied upon.
Is this true and if so what can be done about it?

score 7 · Accepted Answer · edited Aug 30 '21 at 01:59

Very old CPU's have a RDTSC that is accurate.

The problem

However newer CPU's have a problem.
Engineers decided that RDTSC would be great for telling time.
However if a CPU throttles the frequency RDTSC is useless for telling time.
The aforementioned braindead engineers then decided to 'fix' this problem by having the TSC always run at the same frequency, even if the CPU slows down.

This has the 'advantage' that TSC can be used for telling elapsed (wall clock) time. However it makes the TSC ~~useless~~ less useful for profiling.

How to tell if your CPU is not broken

You can tell if your CPU is fine by reading the TSC_invariant bit in the CPUID.

Set EAX to 80000007H and read bit 8 of EDX.
If it is 0 then your CPU is fine.
If it's 1 then your CPU is broken and you need to make sure you profile whilst running the CPU at full throttle.

function IsTimerBroken: boolean;
{$ifdef CPUX86}
asm
  //Make sure RDTSC measure CPU cycles, not wall clock time.
  push ebx
  mov eax,$80000007  //Has TSC Invariant support?
  cpuid
  pop ebx
  xor eax,eax        //Assume no
  and edx,$10        //test TSC_invariant bit
  setnz al           //if set, return true, your PC is broken.
end;
{$endif}
  //Make sure RDTSC measure CPU cycles, not wall clock time.
{$ifdef CPUX64}
asm
  mov r8,rbx
  mov eax,$80000007  //TSC Invariant support?
  cpuid
  mov rbx,r8
  xor eax,eax
  and edx,$10 //test bit 8
  setnz al
end;
{$endif}

How to fix out of order execution issues

See: http://www.intel.de/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf

Use the following code:

function RDTSC: int64;
{$IFDEF CPUX64}
asm
  {$IFDEF AllowOutOfOrder}
  rdtsc
  {$ELSE}
  rdtscp        // On x64 we can use the serializing version of RDTSC
  push rbx      // Serialize the code after, to avoid OoO sneaking in
  push rax      // later instructions before the RDTSCP runs.
  push rdx      // See: http://www.intel.de/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf
  xor eax,eax
  cpuid
  pop rdx
  pop rax
  pop rbx
  {$ENDIF}
  shl rdx,32
  or rax,rdx
  {$ELSE}
{$IFDEF CPUX86}
asm
  {$IFNDEF AllowOutOfOrder}
  xor eax,eax
  push ebx
  cpuid         // On x86 we can't assume the existance of RDTSP
  pop ebx       // so use CPUID to serialize
  {$ENDIF}
  rdtsc
  {$ELSE}
error!
{$ENDIF}
{$ENDIF}
end;

How to run RDTSC on a broken CPU

The trick is to force the CPU to run at 100%.
This is usually done by running the sample code many many times.
I usually use 1.000.000 to start with.
I then time those 1 million runs 10x and take the lowest time of those attempts.

Comparisons with theoretical timings show that this gives very accurate results.

There's also a feature bit for the TSC not stopping during `hlt` sleep states, which also makes it unusable as a timesource. Linux /proc/cpuinfo calls this `nonstop_tsc`. Using `rdtsc` for timing extremely short instruction sequences is also problematic because of out-of-order execution. `rdtscp` can help for that, but other uses may need a full serializing instruction to make sure `rdtsc` instructions don't pass other insns, and that other insns don't pass it. For profiling, use perf counters. — Peter Cordes, Mar 05 '16 at 21:33
@PeterCordes Perf counters suck. That is why we need `rdtsc` why it was broken is a mystery to me. Would it have killed intel to add an extra timer that runs in/out of sync with the main clock? — Johan, Mar 05 '16 at 21:40
I haven't usually had a problem putting my microbenchmark in a loop big enough to use perf counters. For really short sequences, you can use something like IACA or manual uop counting (with Agner Fog's tables and uarch guide) to estimate throughput / latency / fused-domain uop count. I guess it would be nice to have a real cycle counter, I can't disagree. IDK how expensive it would be to implement. Probably not very. If I had to choose, though, the low-overhead high-precision timesource is what I'd pick. — Peter Cordes, Mar 05 '16 at 21:49
@PeterCordes, yes but if I want to *know* the cycles used. I just use RDTSCP making sure the cpu is fully taxed. That way I get timings within 2 CPU-cycles. — Johan, Mar 05 '16 at 21:51
Related: [How to get the CPU cycle count in x86\_64 from C++?](https://stackoverflow.com/a/51907627) has details on CPUID feature bits relevant to CPUID, and other stuff about how it behaves. (Including `lfence` to serialize the pipeline between it and the timed region). — Peter Cordes, Aug 28 '21 at 00:27
Fair point that providing a new instruction for reference cycles would have been an even better choice than butchering `rdtsc` (or a new insn to read core cycles, so rdtsc can keep the shorter encoding), because perf counters are inconvenient to program and aren't always available at all in a VM. But outside a VM, you *can* set things up so `rdpmc` works a lot like `rdtsc` used to, if you have ECX = the index of a counter you programmed to be counting the `cpu_clk_unhalted.thread` event, or of the fixed counter that always counts that event (if it's enabled at all) — Peter Cordes, Aug 28 '21 at 00:29

How to ensure that RDTSC is accurate?

1 Answers1

The problem

How to tell if your CPU is not broken

How to fix out of order execution issues

How to run RDTSC on a broken CPU

Linked