RDTSCP versus RDTSC + CPUID

Question

I'm doing some Linux Kernel timings, specifically in the Interrupt Handling path. I've been using RDTSC for timings, however I recently learned it's not necessarily accurate as the instructions could be happening out of order.

I then tried:

RDTSC + CPUID (in reverse order, here) to flush the pipeline, and incurred up to a 60x overhead (!) on a Virtual Machine (my working environment) due to hypercalls and whatnot. This is both with and without HW Virtualization enabled.
Most recently I've come across the RDTSCP* instruction, which seems to do what RDTSC+CPUID did, but more efficiently as it's a newer instruction - only a 1.5x-2x overhead, relatively.

My question: Is RDTSCP truly accurate as a point of measurement, and is it the "correct" way of doing the timing?

Also to be more clear, my timing is essentially like this, internally:

Save the current cycle counter value
Perform one type of benchmark (i.e.: disk, network)
Add the delta of the current and previous cycle counter to an accumulator value and increment a counter, per individual interrupt
At the end, divide the delta/accumulator by the number of interrupts to get the average cycle cost per interrupt.

*http://www.intel.de/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf page 27

Doesn't `CPUID` implement some kind of full memory barrier? That would be noticeably expensive. — Kerrek SB, Dec 29 '14 at 17:29
@KerrekSB I think the overhead is from the hypercalls, simply because if I run the same test on a non-virtualized environment there's almost no overhead from CPUID. — Ricky Mutschlechner, Dec 29 '14 at 17:30
Is it correct? Well, according to the white paper it's not perfect, but it's the best thing you have. Note that if your measured section is long enough, maybe out-of-order effects are negligible. — Leeor, Dec 29 '14 at 20:18
CPUID unconditionally triggers a VMExit, which should incur a few thousand cycle cost when in virtualization. — ruthafjord, May 28 '19 at 22:12

score 19 · Accepted Answer · edited May 23 '17 at 12:10

19

A full discussion of the overhead you're seeing from the cpuid instruction is available at this stackoverflow thread. When using rdtsc, you need to use cpuid to ensure that no additional instructions are in the execution pipeline. The rdtscp instruction flushes the pipeline intrinsically. (The referenced SO thread also discusses these salient points, but I addressed them here because they're part of your question as well).

You only "need" to use cpuid+rdtsc if your processor does not support rdtscp. Otherwise, rdtscp is what you want, and will accurately give you the information you are after.

Both instructions provide you with a 64-bit, monotonically increasing counter that represents the number of cycles on the processor. If this is your pattern:

uint64_t s, e;
s = rdtscp();
do_interrupt();
e = rdtscp();

atomic_add(e - s, &acc);
atomic_add(1, &counter);

You may still have an off-by-one in your average measurement depending on where your read happens. For instance:

   T1                              T2
t0 atomic_add(e - s, &acc);
t1                                 a = atomic_read(&acc);
t2                                 c = atomic_read(&counter);
t3 atomic_add(1, &counter);
t4                                 avg = a / c;

It's unclear whether "[a]t the end" references a time that could race in this fashion. If so, you may want to calculate a running average or a moving average in-line with your delta.

Side-points:

If you do use cpuid+rdtsc, you need to subtract out the cost of the cpuid instruction, which may be difficult to ascertain if you're in a VM (depending on how the VM implements this instruction). This is really why you should stick with rdtscp.
Executing rdtscp inside a loop is usually a bad idea. I somewhat frequently see microbenchmarks that do things like

--

for (int i = 0; i < SOME_LARGEISH_NUMBER; i++) {
   s = rdtscp();
   loop_body();
   e = rdtscp();
   acc += e - s;
}

printf("%"PRIu64"\n", (acc / SOME_LARGEISH_NUMBER / CLOCK_SPEED));

While this will give you a decent idea of the overall performance in cycles of whatever is in loop_body(), it defeats processor optimizations such as pipelining. In microbenchmarks, the processor will do a pretty good job of branch prediction in the loop, so measuring the loop overhead is fine. Doing it the way shown above is also bad because you end up with 2 pipeline stalls per loop iteration. Thus:

s = rdtscp();
for (int i = 0; i < SOME_LARGEISH_NUMBER; i++) {
   loop_body();
}
e = rdtscp();
printf("%"PRIu64"\n", ((e-s) / SOME_LARGEISH_NUMBER / CLOCK_SPEED));

Will be more efficient and probably more accurate in terms of what you'll see in Real Life versus what the previous benchmark would tell you.

edited May 23 '17 at 12:10

Community

1
1

answered Dec 29 '14 at 23:55

dho

2,310
18
20

1

Are you sure about "Both instructions provide you with a 64-bit, monotonically increasing counter that represents the number of cycles on the processor"? I thought the instructions were ignorant to dynamic underclocking (e.g. Intel's SpeedStep) and dynamic overclocking (e.g. Intel's Turbo Boost) and therefore they don't necessarily report the number of cycles. They keep counting at the same rate even though the frequency dynamically changes. – Z boson Dec 30 '14 at 12:27
@Zboson that was my understanding, too. – Ricky Mutschlechner Dec 30 '14 at 15:32
1

You're right, and that's poorly worded anyway since the instructions provide you with a snapshot of the counter value, not the counter itself. – dho Dec 30 '14 at 16:30
Also, I am definitely doing it the 2nd way after reading the whitepaper and whatnot. Thanks for the insight! – Ricky Mutschlechner Jan 05 '15 at 01:16
@Zboson If both rdtsc and rdtscp are ignorant to dynamic frequency scaling, what's the frequency for increasing the counter? In presence of frequency scaling, is there anyway to get the actual cycle count? – Rakib Jun 12 '15 at 14:44
@Rakib, I have not looked into this for a while but if you look at my answer below I think I think you will find your answer. I infer the frequency using method 3 in my answer. To directly get the cycle count can be a pain. – Z boson Jun 15 '15 at 17:46
@Zboson Thanks for your reply. Yes. Method 3 is very useful for finding the correct frequency or close to the accurate one. Can you suggest anything that could help me get the actual cycle counts? – Rakib Jun 17 '15 at 15:30
@Rakib, `frequency = cycles/second` so if you know the frequency and the time you can get the cycles. You can use method three to estimate/infer the cycles. If you want to know the exact number then you have to read performance counters I think. That can be a bit of pain I think. – Z boson Jun 18 '15 at 07:43
@Zboson Yes. I actually need the exact cycle count in presence of frequency scaling. If rdtsc is ignorant to scaling, shouldn't performance counters be ignorant to it too? (I am not sure how rdtsc works, but I thought it was internally reading such counters.) – Rakib Jun 18 '15 at 13:56
You can use http://www.intel.com/software/pcm, but it requires BIOS support to get at the counters. – dho Jun 18 '15 at 16:54
@Rakib, if you need the exact count then the simple solution is method 1 I described. Go into your BIOS or UEFI and disable frequency scaling. Not every BIOS or UEFI allows you to do this. But then you loose turbo and your processor runs at a nominal frequency all the time. It's sorta like disabling the throttle in a car. Your engine would never idle at stop lights and would not have turbo either. It would always run at the same RPM. You don't want to run your computer like this all the time. Method 2 fixes this but then you have to install a device driver. – Z boson Jun 19 '15 at 07:40
@Rakib, I'm also not sure even if you disable frequency scaling i the BIOS/UEFI that it never scales anymore. There may be safety features in your processor that kick in which you can't disable which scales the frequency down if the processor gets to hot. So I'm not sure how reliable method 1 is. So far for me it's worked fine but I did not test it on an overclocked system or on a system that gets really hot like [Gigabyte Brix](http://www.anandtech.com/show/7648/gigabyte-brix-pro/4) which does throttle when it gets too hot. – Z boson Jun 19 '15 at 07:44
@dho Thank you for your reply. I actually need solutions for all archs x86 (even from AMD), ARM etc. that can be done without accessing the BIOS. – Rakib Jun 19 '15 at 18:24
@Zboson Yes. So far, disabling it from BIOS worked for me, too (which is actually of no use, since you can get a rough count from the time spent and the known frequency that you set from BIOS). But I just wanted to know if there is any reliable way to get cycle count in presence of scaling. – Rakib Jun 19 '15 at 18:24
I'm not sure this is correct. The Intel whitepaper explicitly states `rdtscp` lets instructions afterwards start executing before it completes which is why they pair it with a `cpuid` afterwards. With this approach instructions after the second `rdtscp` can be moved in front of it, which I then assume can contend for CPU resources with the instructions you are trying to measure? – Joseph Garvin Aug 09 '21 at 19:05
I think you're pretty sure it's incorrect :). I agree. Will edit. – dho Aug 10 '21 at 20:54

maxschlepzig · Answer 2 · 2019-09-28T12:49:03.540

8

The 2010 Intel paper How to Benchmark Code Execution Times on Intel ® IA-32 and IA-64 Instruction Set Architectures can be considered as outdated when it comes to its recommendations to combine RDTSC/RDTSCP with CPUID.

Current Intel reference documentation recommends fencing instructions as more efficient alternatives to CPUID:

Note that the SFENCE, LFENCE, and MFENCE instructions provide a more efficient method of controlling memory ordering than the CPUID instruction.

(Intel® 64 and IA-32 Architectures Software Developer’s Manual: Volume 3, Section 8.2.5, September 2016)

If software requires RDTSC to be executed only after all previous instructions have executed and all previous loads and stores are globally visible, it can execute the sequence MFENCE;LFENCE immediately before RDTSC.

(Intel RDTSC)

Thus, to get the TSC start value you execute this instruction sequence:

mfence
lfence
rdtsc
shl     rdx, 0x20
or      rax, rdx

At the end of your benchmark, to get the TSC stop value:

rdtscp
lfence
shl     rdx, 0x20
or      rax, rdx

Note that in contrast to CPUID, the lfence instruction doesn't clobber any registers, thus it isn't necessary to rescue the EDX:EAX registers before executing the serializing instruction.

Relevant documentation snippet:

If software requires RDTSCP to be executed prior to execution of any subsequent instruction (including any memory accesses), it can execute LFENCE immediately after RDTSCP (Intel RDTSCP)

As an example how to integrate this into a C program, see also my GCC inline assembler implementations of the above operations.

edited Sep 28 '19 at 12:49

answered Sep 28 '19 at 12:43

maxschlepzig

35,645
14
145
182

If you're using `rdtsc` / `rdtscp` twice, you do need to save the first result outside of RDX:RAX. e.g. instead of `or`, use `lea rcx, [rax + rdx]` to merge and write the result to a different register. – Peter Cordes Sep 28 '19 at 12:55
@PeterCordes the above code is meant to be put into 2 separate functions which are used like - say - this: `start = fenced_rdtsc(); to_be_benchmarked_code(); stop = fenced_rdtscp(); duration = stop-start;` – maxschlepzig Sep 28 '19 at 13:05
Oh. IDK why you'd want call/ret overhead inside your timed region if you're going to the trouble of using `rdtsc` directly for high precision instead of just using a repeat-loop to amortize measurement error into negligibility. But ok. I would have defined those as CPP macros using a GNU C inline asm statement, using LEA to let the first one use an `"=r"` output operand. Or since I know the delta is going to be far less than 2^32, just ignore the high half of the output and declare a clobber on `rdx` and an `"=a"` output. High bits of input don't affect the low 32 of the output for sub. – Peter Cordes Sep 28 '19 at 22:27
@PeterCordes No need to use a CPP macro - of course you 'convince' the compiler to inline those functions and then you don't have call/ret overhead and it's a safer/better maintainable alternative to a CPP macro. ([see also how I implemented that approach in a project of mine](https://github.com/gsauthof/osjitter/blob/4545f212f99c0b232b64c175f4d45a51ea23f954/osjitter.c#L556-L612)) – maxschlepzig Sep 29 '19 at 08:34
I assumed you meant your functions were written in pure asm, not inline asm, probably because your answer is written with hard-coded regs etc. If you're not doing that, you might as well use intrinsics all the way for `_mm_lfence()`, `_mm_mfence()`, and `_mm_rdtsc()`. If you were using inline asm, then the compiler is going to want to choose non-conflicting registers for the two `asm` statements. Or insert an extra `mov` into the timed region after inlining both statements into their caller, if they both use `"=a"` and clobber RDX. – Peter Cordes Sep 29 '19 at 08:42
@PeterCordes the disadvantage when using `__rdtsc()` etc. intrinsics is that you then don't control the exact instruction ordering/selection anymore. For example where the `EDX:EAX` result of `RDTSC*` is merged. When computing the TSC stop value with `_rdtscp()` you'll get something like ` rdtscp; shl rdx, 0x20;or rax, rdx; lfence;` instead of `rdtscp; lfence; shl rdx, 0x20; or rax, rdx`. Or worse. And with the `__rdtscp()` intrinsic you also have to specify an output variable for `ECX` which isn't eliminated by the compiler and thus you get an extra `mov` of `ECX` to the stack. – maxschlepzig Sep 29 '19 at 09:18
The spill of ECX is after the timed region; you only use RDTSCP at the bottom. But yeah, that's a weird gcc missed optimization; clang doesn't do that. https://godbolt.org/z/jhrjI0. clang also avoids merging RDX into RAX if you only return a 32-bit integer. Anyway, I don't see the problem with having the merge before the `lfence`; those instructions are dependent on RDTSCP so they can't execute before it anyway. The lfence *after* rdtscp is just to stop later work from sneaking into the timed region, I think. RDTSCP itself is a one-way barrier which is why you don't need an lfence first. – Peter Cordes Sep 29 '19 at 09:28
`lfence` only makes sense to me if you use `rdtscp` at the beginning of the timed region, to make sure no _subsequent_ instruction that you want to time, starts execution prior to `rdtscp` getting a timestamp: "If software requires RDTSCP to be executed prior to execution of any subsequent instruction". – kavadias Mar 20 '23 at 08:01
I don't see why `rdtscp` at the end of the timed region would need an `lfence` instruction. `rdtscp` will execute after to any _previous_ instructions in program order (if placed in `asm volatile` you can instruct that order): "[rdtscp] does wait until all previous instructions have executed and all previous loads are globally visible." An instruction after the timed region will only execute before an `rdtscp` at the end, if `rdtscp` cannot (waits prior instructions). – kavadias Mar 20 '23 at 08:25

Z boson · Answer 3 · 2015-01-02T10:02:03.797

Is RDTSCP truly accurate as a point of measurement, and is it the "correct" way of doing the timing?

Modern x86 CPUs can dynamically adjust the frequency to save power by under clocking (e.g. Intel's SpeedStep) and to boost performance for heavy load by over-clocking (e.g. Intel's Turbo Boost). The time stamp counter on these modern processors however counts at a constant rate (e.g. look for "constant_tsc" flag in Linux's /proc/cpuinfo).

So the answer to your question depends on what you really want to know. Unless, dynamic frequency scaling is disabled (e.g. in the BIOS) the time stamp counter can no longer be relied on to determine the number of cycles that have elapsed. However, the time stamp counter can still be relied on to determine the time that has elapsed (with some care - but I use clock_gettime in C - see the end of my answer).

To benchmark my matrix multiplication code and compare it to the theoretical best I need to know both the time elapsed and the cycles elapsed (or rather the effective frequency during the test).

Let me present three different methods to determine the number of cycles elapsed.

Disable dynamic frequency scaling in the BIOS and use the time stamp counter.
For Intel processors request the core clock cycles from the performance monitor counter.
Measure the frequency under load.

The first method is the most reliable but it requires access to BIOS and affects the performance of everything else you run (when I disable dynamic frequency scaling on my i5-4250U it runs at a constant 1.3 GHz instead of a base of 2.6 GHz). It's also inconvenient to change the BIOS only for benchmarking.

The second method is useful when you don't want to disable dynamic frequency scale and/or for systems you don't have physical access to. However, the performance monitor counters require privileged instructions which only the kernel or device drivers have access to.

The third method is useful on systems where you don't have physical access and do not have privileged access. This is the method I use most in practice. It's in principle the least reliable but in practice it's been as reliable as the second method.

Here is how I determine the time elapsed (in seconds) with C.

#define TIMER_TYPE CLOCK_REALTIME

timespec time1, time2;
clock_gettime(TIMER_TYPE, &time1);
foo();
clock_gettime(TIMER_TYPE, &time2);
double dtime = time_diff(time1,time2);

double time_diff(timespec start, timespec end)
{
    timespec temp;
    if ((end.tv_nsec-start.tv_nsec)<0) {
        temp.tv_sec = end.tv_sec-start.tv_sec-1;
        temp.tv_nsec = 1000000000+end.tv_nsec-start.tv_nsec;
    } else {
        temp.tv_sec = end.tv_sec-start.tv_sec;
        temp.tv_nsec = end.tv_nsec-start.tv_nsec;
    }
    return (double)temp.tv_sec +  (double)temp.tv_nsec*1E-9;
}

This looks solid also. The thing is, I actually need the time passed in CPU cycles as I'm doing a microbenchmark on something that's supposed to run in <100cyc once we're done optimizing. I'm also running it in a virtual machine that ( I believe ) has a set frequency for the processor. I'll double check and edit this accordingly. — Ricky Mutschlechner, Dec 30 '14 at 15:33
@RickyMutschlechner, yeah, by timing I suspected that you might have been cycles and not time. I don't know about doing this on a virtual machine. I highly doubt you could use the third method I suggested. I'm even skeptical of the other methods because you have to assume the virtual machine implements the hardware timing in the same way (same latency and throughput) and this is not even fully documented. But I have little experience with virtual machines. My gut says nothing beats testing on real hardware. Doing this on a virtual machine is an interesting question. — Z boson, Jan 02 '15 at 10:00

Johan · Answer 4 · 2016-03-06T17:16:04.703

The following code will ensure that rdstcp kicks in at exactly the right time. RDTSCP cannot execute too early, but it can execute to late because the CPU can move instructions after rdtscp to execute before it.

In order to prevent this we create a false dependency chain based on the fact that rdstcp puts its output in edx:eax

rdtscp       ;rdstcp is read serialized, it will not execute too early.
;also ensure it does not execute too late
mov r8,rdx   ;rdtscp changes rdx and rax, force dependency chain on rdx
xor r8,rbx   ;push rbx, do not allow push rbx to execute OoO
xor rbx,rdx  ;rbx=r8
xor rbx,r8   ;rbx = 0
push rdx
push rax
mov rax,rbx  ;rax = 0, but in a way that excludes OoO execution.
cpuid
pop rax
pop rdx
mov rbx,r8
xor rbx,rdx  ;restore rbx

Note that even though this time is accurate up to a single cycle.
You still need to run your sample many many times and take the lowest time of those many runs in order to get the actual running time.

Current Intel documentation recommends to just execute LFENCE after RDTSCP - instead of CPUID, cf. my answer. — maxschlepzig, Sep 28 '19 at 12:49

RDTSCP versus RDTSC + CPUID

4 Answers4

Linked