Your results here are probably measurement noise and/or frequency scaling, since you start the timer for the 2nd interval right after printf
returns from making a system call.
RDTSC counts reference cycles, not core clock cycles, so you're mostly just discovering the CPU frequency. (Lower core clock speed = more reference cycles for the same number of core clocks to run two rdtsc instructions). Your RDTSC instructions are basically back-to-back; the nop
instructions are negligible compared to the amount of uops that rdtsc
itself decodes to (on normal CPUs including your Broadwell).
Also RDTSC can be reordered by out-of-order execution. Not that nop
does anything that the CPU would have to wait for; it's just delaying the front-end by 0.25 or 1.75 cycles from issuing the uops of the 2nd rdtsc
. (Actually I'm not sure if the microcode sequencer can send uops in the same cycle as a uop from another instruction. So maybe 1 or 2 cycles).
My answer on How to get the CPU cycle count in x86_64 from C++? has a bunch of background on how RDTSC works.
You might want the pause
instruction. It idles for ~100 core clock cycles on Skylake and later, or ~5 cycles on earlier Intel cores. Or spin on PAUSE + RDTSC. How to calculate time for an asm delay loop on x86 linux? shows a possibly-useful delay spinloop that sleeps for a given number of RDTSC counts. You need to know the reference clock speed to correlate that with nanoseconds, but it's typically around the rated max non-turbo clock on Intel CPUs. e.g. 4008 MHz on a 4.0GHz Skylake.
If available, tpause
takes a TSC timestamp as the wake-up time. (See the link). But it's only low-power Tremont for now.
Inserting NOPs is never going to work reliably on modern superscalar / out-of-order x86 with huge reorder buffers! Modern x86 isn't a microcontroller where you can calculate iterations for a nested delay loop. If surrounding code doesn't bottleneck on the front-end, OoO exec is just going to hide the cost of feeding your NOPs through the pipeline.
Instructions don't have a cost you can just add up. To model the cost of an instruction, you need to know its latency, front-end uop count, and which back-end execution ports it needs. And any special effects on the pipeline, like lfence
waiting for all previous uops to retire before later ones can issue. How many CPU cycles are needed for each assembly instruction?
See also What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?
Note that your desired "sleep" time of ~100ns isn't necessarily even long enough to drain the out-of-order execution buffer (the ROB) if there are cache misses in flight, or possibly even very a slow ALU dependency chain. (The latter is unlikely outside of artificial cases). So you probably don't want to do anything like lfence
.