2

For a benchmarking application I want to write a C++20 function template like this:

template<size_t n>
void noop() { /* ??? */; }

That when instantiated and executed:

  1. has no side effects.
  2. takes n nanoseconds to execute
  3. won't be optimized out.

This is of course impossible to do exactly as there might be no sequence of instructions that deterministically takes exactly n nanoseconds to complete on some processors, but what's the best approximation we could do?

For large n (say > 100000) we can just use std::this_thread::sleep_for. Although it guarantees at least n, in practice its usually within 100 microseconds or so.

But what about for small n?

And is there a way to achieve (3) from the function declaration/definition of noop ? Is there a a way to declare it to not be optimized out?

I'd be curious to know how far we could get with portable / standard-compliant C++20 - in addition to how far we could get if we restricted ourselves to x86-64 and a particular implementation (msvc, clang, gcc)

Any ideas? For 100 to 100000 what about polling rdtsc? What about n < 100 ?

Andrew Tomazos
  • 66,139
  • 40
  • 186
  • 319
  • For very small `n` (e.g. 3) it might be impossible. Be aware of the CPU cache – Basile Starynkevitch Jun 28 '22 at 06:16
  • 1
    @BasileStarynkevitch: Right, for requirment (2), as stated in the question, its impossible to do exactly, the question is what's the best/closest approximation. – Andrew Tomazos Jun 28 '22 at 06:21
  • 1
    A tight loop reading a clock until you reach your desired duration is about the best you can do on non real-time operating systems. You can try using cpu tsc instead of a clock but you'll need to do some conversion to get to seconds – Alan Birtles Jun 28 '22 at 06:30
  • @AlanBirtles: Yup, and it's non-trivial to find out the TSC frequency. At least on modern CPUs, the TSC *is* usable as a clock-source (especially over short intervals where drift without NTP correction isn't an issue), since it's constant frequency and doesn't halt in power-save. And is usually synced across cores on modern systems, although that's not always guaranteed. [How to calculate time for an asm delay loop on x86 linux?](https://stackoverflow.com/q/49924102) shows hand-written asm for an RDTSC deadline spin-wait, and mentions `tpause` to do the same thing on very recent CPUs. – Peter Cordes Jun 28 '22 at 07:49
  • `nop` isn't useful; on an SMT system (2 logical cores per physical core) you don't want to clog up the CPU's execution resources churning through 4 to 6 NOPs per clock, at least for cycles where this logical core has the front-end. If you were going to attempt anything that depended on core clock cycles instead of wall-clock time, you'd want `.loop: dec eax` / `jnz .loop` which all current CPUs execute at 1 iteration per clock cycle. Given a max clock frequency, that would let you sleep for *at least* `n` nanosecs. Of course, if it's short enough, it wouldn't block OoO exec across it! – Peter Cordes Jun 28 '22 at 07:52
  • `rdtsc` itself has a throughput of about 25 (Skylake) to 38 (Zen2) core clock cycles, so your delay granularity is about 6 ns with that strategy on a Skylake-derived CPU if it happens to currently be at 4 GHz. Much more if it happened to be at idle speed, like 800 MHz, like 31 ns. (And you need one RDTSC to find the current time before you can start waiting for the delay). – Peter Cordes Jun 28 '22 at 07:58
  • Is it really meaningful to delay for that short a time on a deep out-of-order exec CPU? Going off-core for I/O takes much longer than that. It's a fun thought exercise, but anything practical would depend on exactly **why you needed the delay**, like what you were guarding against or hoping to ensure, and how much of a problem it would be if you ever waited too short a time. (e.g. potentially fatal correctness problem for some I/O uses, vs. being ok to step on something sometimes.) – Peter Cordes Jun 28 '22 at 08:00
  • 1
    @PeterCordes: The general use case is that I'm benchmarking complex generic algorithms that take inner functions (possibly inlined) as a parameter, so I want to mock those inner function parameters with this `noop` function, and then measure how the algorithms behave as the duration of the inner function varies (among other parameters that vary). These noop functions are getting called millions of times in some cases, so they need to be short. – Andrew Tomazos Jun 28 '22 at 08:09
  • @AndrewTomazos: Oh! So you want to simulate some execution cost, not wall-time at all. For short enough pieces of code (not much larger than the out-of-order exec window size, like a couple hundred instructions thus tens of nanoseconds) that comes in 3 dimensions, none of them time: front-end bandwidth (in uops), latency (clock cycles from an input being ready to an output being read), and back-end port contention (in uops for each port). See [this Q&A](https://stackoverflow.com/q/51607391/224132). Branch mispredicts can also affect the ability of OoO exec to see past/around a function. – Peter Cordes Jun 28 '22 at 08:26
  • For your purposes, probably a delay loop could work, perhaps `for (volatile int i = n; --i >= 0; ){}`. (volatile forces a store/reload, creating a latency bottleneck, thus about 5 cycles per iteration, but OoO exec can see past it if it can predict the loop exit correctly. For Skylake, with this inside another tight loop, that probably happens for n=20 or so.) Or maybe use Google Benchmark for `Benchmark::DoNotOptimize(i)` for non-volatile i, to force GCC/clang to materialize the value in a register, and for clang maybe store it, but not a loop-carried dependency through the store/reload. – Peter Cordes Jun 28 '22 at 08:30

0 Answers0