I've read that RDTSC can gives false readings and should not be relied upon.
Is this true and if so what can be done about it?
1 Answers
Very old CPU's have a RDTSC that is accurate.
The problem
However newer CPU's have a problem.
Engineers decided that RDTSC would be great for telling time.
However if a CPU throttles the frequency RDTSC is useless for telling time.
The aforementioned braindead engineers then decided to 'fix' this problem by having the TSC always run at the same frequency, even if the CPU slows down.
This has the 'advantage' that TSC can be used for telling elapsed (wall clock) time. However it makes the TSC useless less useful for profiling.
How to tell if your CPU is not broken
You can tell if your CPU is fine by reading the TSC_invariant
bit in the CPUID.
Set EAX
to 80000007H and read bit 8 of EDX
.
If it is 0 then your CPU is fine.
If it's 1 then your CPU is broken and you need to make sure you profile whilst running the CPU at full throttle.
function IsTimerBroken: boolean;
{$ifdef CPUX86}
asm
//Make sure RDTSC measure CPU cycles, not wall clock time.
push ebx
mov eax,$80000007 //Has TSC Invariant support?
cpuid
pop ebx
xor eax,eax //Assume no
and edx,$10 //test TSC_invariant bit
setnz al //if set, return true, your PC is broken.
end;
{$endif}
//Make sure RDTSC measure CPU cycles, not wall clock time.
{$ifdef CPUX64}
asm
mov r8,rbx
mov eax,$80000007 //TSC Invariant support?
cpuid
mov rbx,r8
xor eax,eax
and edx,$10 //test bit 8
setnz al
end;
{$endif}
How to fix out of order execution issues
Use the following code:
function RDTSC: int64;
{$IFDEF CPUX64}
asm
{$IFDEF AllowOutOfOrder}
rdtsc
{$ELSE}
rdtscp // On x64 we can use the serializing version of RDTSC
push rbx // Serialize the code after, to avoid OoO sneaking in
push rax // later instructions before the RDTSCP runs.
push rdx // See: http://www.intel.de/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf
xor eax,eax
cpuid
pop rdx
pop rax
pop rbx
{$ENDIF}
shl rdx,32
or rax,rdx
{$ELSE}
{$IFDEF CPUX86}
asm
{$IFNDEF AllowOutOfOrder}
xor eax,eax
push ebx
cpuid // On x86 we can't assume the existance of RDTSP
pop ebx // so use CPUID to serialize
{$ENDIF}
rdtsc
{$ELSE}
error!
{$ENDIF}
{$ENDIF}
end;
How to run RDTSC on a broken CPU
The trick is to force the CPU to run at 100%.
This is usually done by running the sample code many many times.
I usually use 1.000.000 to start with.
I then time those 1 million runs 10x and take the lowest time of those attempts.
Comparisons with theoretical timings show that this gives very accurate results.
-
There's also a feature bit for the TSC not stopping during `hlt` sleep states, which also makes it unusable as a timesource. Linux /proc/cpuinfo calls this `nonstop_tsc`. Using `rdtsc` for timing extremely short instruction sequences is also problematic because of out-of-order execution. `rdtscp` can help for that, but other uses may need a full serializing instruction to make sure `rdtsc` instructions don't pass other insns, and that other insns don't pass it. For profiling, use perf counters. – Peter Cordes Mar 05 '16 at 21:33
-
@PeterCordes Perf counters suck. That is why we need `rdtsc` why it was broken is a mystery to me. Would it have killed intel to add an extra timer that runs in/out of sync with the main clock? – Johan Mar 05 '16 at 21:40
-
I haven't usually had a problem putting my microbenchmark in a loop big enough to use perf counters. For really short sequences, you can use something like IACA or manual uop counting (with Agner Fog's tables and uarch guide) to estimate throughput / latency / fused-domain uop count. I guess it would be nice to have a real cycle counter, I can't disagree. IDK how expensive it would be to implement. Probably not very. If I had to choose, though, the low-overhead high-precision timesource is what I'd pick. – Peter Cordes Mar 05 '16 at 21:49
-
@PeterCordes, yes but if I want to *know* the cycles used. I just use RDTSCP making sure the cpu is fully taxed. That way I get timings within 2 CPU-cycles. – Johan Mar 05 '16 at 21:51
-
Related: [How to get the CPU cycle count in x86\_64 from C++?](https://stackoverflow.com/a/51907627) has details on CPUID feature bits relevant to CPUID, and other stuff about how it behaves. (Including `lfence` to serialize the pipeline between it and the timed region). – Peter Cordes Aug 28 '21 at 00:27
-
Fair point that providing a new instruction for reference cycles would have been an even better choice than butchering `rdtsc` (or a new insn to read core cycles, so rdtsc can keep the shorter encoding), because perf counters are inconvenient to program and aren't always available at all in a VM. But outside a VM, you *can* set things up so `rdpmc` works a lot like `rdtsc` used to, if you have ECX = the index of a counter you programmed to be counting the `cpu_clk_unhalted.thread` event, or of the fixed counter that always counts that event (if it's enabled at all) – Peter Cordes Aug 28 '21 at 00:29