4

Some times I need a proper way to measure performance at nanosecond from my user space application in order to include the syscall delays in my measurement. I read many old (10yo) articles saying it isn't any stable and they are gonna remove it from the user space.

  • In 2020, for Intel 8th/9th generation x86-64 CPUs, how stable is it? Can we still use TSC assembly code in a safely manner?
  • What the best practices to use TSC in the user space nowadays?

Links:

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Alexis
  • 2,136
  • 2
  • 19
  • 47
  • 1
    Since you mentioned "Intel 8th/9th generation" I assumed you were talking about x86-64, not the dead / discontinued IA-64 Itanium and edited your question accordingly. – Peter Cordes Apr 26 '20 at 10:26

1 Answers1

4

It's as stable as the clock crystal on your motherboard, but it's locked to a reference frequency (which depends on the CPU model), not the current CPU core clock frequency. That change was about 15 years ago (constant_tsc CPU feature) making it usable for wall-clock timing instead of cycle counting.

For example, the Linux VDSO user-space implementation of clock_gettime uses rdtsc and a scale factor to calculate an offset from the less-frequently-updated timestamp updated by the kernel's timer interrupt. (VDSO = pages of code and data owned by the kernel, mapped read-only into user-space processes.)

What the best practices to use TSC in the user space nowadays?

If you want to count core clock cycles, use rdpmc (with a HW perf counter programmed appropriately and set up so user-space is allowed to read it.) Or user perf or other way of using HW perf counters.

But other than that, you can use rdtsc directly or indirectly via wrapper libraries.

Depending on your overhead requirements, and how much effort you're willing to put into finding out TSC frequency so you can relate TSC counts to seconds, you might just use it via std::chrono or libc clock_gettime which don't need to actually enter the kernel thanks to the VDSO.

How to get the CPU cycle count in x86_64 from C++? - my answer there has more details about the TSC, including how it worked on older CPUs, and the fact that out-of-order execution means you need lfence before/after rdtsc if you want to wait for earlier code to finish executing before it reads the internal TSC.

Measuring chunks of code shorter than a few hundred instructions introduces the complication that throughput and latency are different things, it's not meaningful to measure performance with just a single number. Out-of-order exec means that the surrounding code matters.

and they are gonna remove it from the user space.

x86 has basically never removed anything, and definitely not from user-space. Backwards compat with existing binaries is x86's main claim to fame and reason for continued existence.

rdtsc is documented in Intel and AMD's x86 manuals, e.g. Intel's vol.2 entry for it. There is a CPU feature that lets the kernel disable RDTSC for user-space (TSD = TimeStamp Disable) but it's not normally used on Linux. (Note the #GP(0) exception: If the TSD flag in register CR4 is set and the CPL is greater than 0 - Current Privilege Level 0 = kernel, higher = user-space.

IDK if there are any plans to use TSD by default; I'd assume not because it's a useful and efficient timesource. Even if so, on a dev machine where you want to do profiling / microbenchmarking you'd be able to toggle that feature. (Although usually I just put stuff in a large-enough repeat loop in a static executable and run it under perf stat to get total time and HW perf counters.)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • The time between 2 measurements doesn't matter. It'll be shorter than 100s instructions but with PCIe IO, the time should be several 100s nanoseconds. `lfence` or `cpuid` as defined in the link I provided? No problem to get the TSC freq. Thanks! – Alexis Apr 26 '20 at 10:47
  • 1
    @Alexis_FR_JP: Yes, you can reliably use `lfence` to serialize execution of `rdtsc` wrt. execution of other instructions. But a store instruction finishes *executing locally* (i.e. retires) before it leaves the store buffer. To time I/O completion you might want `mfence` + `lfence`. (Although on Skylake-derived microarchitectures, it seems `mfence` already includes some kind of lfence-like serialization of exec. But maybe *before* the part that drains the store buffer, so that could possibly reorder with a later lfence. – Peter Cordes Apr 26 '20 at 11:18
  • 1
    See [Are loads and stores the only instructions that gets reordered?](https://stackoverflow.com/a/50496379) for a perf experiment on SKL, and [Does lock xchg have the same behavior as mfence?](https://stackoverflow.com/q/40409297) – Peter Cordes Apr 26 '20 at 11:18
  • @Alexis_FR_JP - and BTW, next time please put details like that about your use-case in the question. It would have been easier to write this answer if I'd known what kind of thing you wanted to time. – Peter Cordes Apr 26 '20 at 11:40
  • 1
    @Alexis_FR_JP: IDK how else your question could have been answered. It's fairly broad so there's a lot of ground to cover. "yes, you can always use RDTSC" wouldn't make much of an answer :P – Peter Cordes Apr 26 '20 at 11:51