0

I am new to system programing, and I have some doubts about how to use __rdtsc.

Here is a quote from Microsoft Learn:

Generates the rdtsc instruction, which returns the processor time stamp. The processor time stamp records the number of clock cycles since the last reset.

Is it a good practice to use following code to measure the CPU cycles of a given operation/function?

#include <x86intrin.h> 

void func() {
    unsigned long long start, end;
    start = __rdtsc();

    // Function call here    

    end = __rdtsc();
    unsigned long long cycles = end - start; 
}

Between start and end, is it possible that CPU switches to another process so that there are some extra CPU cycles recorded in addition to the intended function call? If so, how to measure it precisely?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
chenzhongpu
  • 6,193
  • 8
  • 41
  • 79
  • Since every system is a multi-tasking one, how to measure it in practice? – chenzhongpu Apr 20 '23 at 11:56
  • Not *every* system is multi-tasking... :) There are plenty of non-multitasking embedded systems. And during OS startup there could be plenty of (CPU) time before multi-tasking is enabled to use `rdtsc` in a meaningful way. – Some programmer dude Apr 20 '23 at 12:01
  • This is to estimate the real time elapsed. E.g. to recognize timeout. – i486 Apr 20 '23 at 12:01
  • 3
    The time stamp counter does not count CPU cycles. It changed many years ago. In modern processors, it serves as a wall clock timer, per *Intel 64 and IA-32 Architectures Software Developer’s Manual*, December 2017, Volume 3B, 17.17, page 17-41, “Constant TSC behavior ensures that the duration of each clock tick is uniform and supports the use of the TSC as a wall clock timer even if the processor core changes frequency. This is the architectural behavior moving forward.” – Eric Postpischil Apr 20 '23 at 12:04
  • 1
    @i486 For a simple timeout, on Linux or macOS I recommend you use something like [`clock_gettime`](https://pubs.opengroup.org/onlinepubs/9699919799/functions/clock_getres.html) with a `CLOCK_MONOTONIC` clock (which might as well use `rdtsc` internally). Windows probably have something similar, though I don't know about it. I don't think that's a good use-case for `rdtsc` directly yourself. – Some programmer dude Apr 20 '23 at 12:15
  • @Someprogrammerdude Timeout is not my main concern. "However, if you are benchmarking computational tasks where you avoid disk and network accesses and where you only access a few pages of memory, then the time elapsed is often not ideal because it can vary too much from run to run and it provides too little information." From [Counting cycles and instructions on the Apple M1 processor](https://lemire.me/blog/2021/03/24/counting-cycles-and-instructions-on-the-apple-m1-processor/) – chenzhongpu Apr 20 '23 at 12:24
  • @Someprogrammerdude I think `__rdtsc` is Microsoft (Windows) specific. No need to comment Linux or macOS. – i486 Apr 20 '23 at 12:29
  • @i486 No, it is available on x86 arch. (I have tested in my Manjaro desktop) – chenzhongpu Apr 20 '23 at 12:30
  • @Someprogrammerdude `clock_gettime()` with `CLOCK_THREAD_CPUTIME_ID` should be appropriate. – dimich Apr 20 '23 at 13:22
  • @chenzhongpu: `__rdtsc()` is also just elapsed time, not core clock cycles. Look at Eric Postpischil's comment before yours. (And [How to get the CPU cycle count in x86\_64 from C++?](https://stackoverflow.com/q/13772567)) – Peter Cordes Apr 21 '23 at 08:22
  • @i486: See [How to get the CPU cycle count in x86\_64 from C++?](https://stackoverflow.com/q/13772567) for the right headers on MSVC vs. other compilers which define an `__rdtsc()` intrinsc. – Peter Cordes Apr 21 '23 at 08:23

1 Answers1

0

Between start and end, is it possible that CPU switches to another process so that there are some extra CPU cycles recorded in addition to the intended function call?

Yes.

It's a monotonic always increasing and global counter.

On a multi-tasking system it's not a reliable or accurate way to do benchmarks of anything that might take more than a few operations.

It's also doesn't give true results because it doesn't wait for the pipeline to clear, so some of the instructions you want to benchmark might not even be finished yet for the second call.


If you want to benchmark a function it's better to find a clock with high enough resolution to fit your needs, and that is bound to your process execution and not global (so not a wall-clock like the one used by the clock call on Windows).

Some programmer dude
  • 400,186
  • 35
  • 402
  • 621
  • I noticed Lemire released [some code](https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/tree/master/2023/03/21) for benchmark on Linux and Arm. Is it reliable? – chenzhongpu Apr 20 '23 at 12:01
  • @chenzhongpu It seems to be using kernel-specific performance counters, so it might be. – Some programmer dude Apr 20 '23 at 12:06
  • 1
    @chenzhongpu: If your timed region is so small that RDTSC timing overhead and out-of-order exec are big problems, it's normally better to put your code in a repeat loop and take the average (like Google::Benchmark does); if you're careful to check that the compiler isn't hoisting some of your work out of the loop. Using a function call with much higher overhead isn't better! See also [Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987). And re: RDTSC details, [How to get the CPU cycle count in x86\_64 from C++?](https://stackoverflow.com/q/13772567) – Peter Cordes Apr 20 '23 at 12:13
  • The normal state of affairs is the many instructions are in flight in the CPU's out-of-order execution machinery. Often what matters is the throughput cost of your code, not the end-to-end latency of a small snippet that the CPU can overlap with surrounding code. So anything like `rdtscp(&dummy)` or `_mm_lfence();` `__rdtsc()` to wait for all earlier instructions to complete will defeat that, and is mostly useful for microarchitectural experiments, not for a very short-duration piece of code you're optimizing. (Unless you're using that to time a repeat loop.) – Peter Cordes Apr 20 '23 at 12:16