Multiple nop instructions do not consistently take longer than a single nop instruction

Question

I am timing multiple NOP instructions and a single NOP instruction in C++, using rdtsc. However, I don't get an increase in the number of cycles it takes to execute NOPs in proportion to the number of NOPs executed. I'm confused as to why this is the case. My CPU is Intel Core i7-5600U @ 2.60Ghz.

Here's the code:

#include <stdio.h>

int main() {
    unsigned long long t;

    t = __rdtsc();
    asm volatile("nop");
    t = __rdtsc() - t;
    printf("rdtsc for one NOP: %llu\n", t);

    t = __rdtsc();
    asm volatile("nop; nop; nop; nop; nop; nop; nop;");
    t = __rdtsc() - t;
    printf("rdtsc for seven NOPs: %llu\n", t);

}

I am getting values like:

rdtsc for one NOP: 78
rdtsc for seven NOPs: 91

rdtsc for one NOP: 78
rdtsc for seven NOPs: 78

when running without setting processor affinity. When setting processor affinity like $ taskset -c 0 ./nop$ , the results are:

rdtsc for one NOP: 78
rdtsc for seven NOPs: 78

rdtsc for one NOP: 130
rdtsc for seven NOPs: 169

rdtsc for one NOP: 78
rdtsc for seven NOPs: 143

Why would this be the case?

With an x86 as well as how you wrote the benchmark this is not surprising. What is it that you are really trying to do? — old_timer, Oct 15 '19 at 01:49
I am trying to sleep in the tenth of a microsecond range. But when I use `nanosleep`, with the sleep interval set to even a single nanosecond, the execution of `nanosleep` ends up taking >20000 cycles (according to `rdtsc`). This is why I'm trying to directly cause a very tiny delay to occur using nops. — fraiser, Oct 15 '19 at 01:59
You might want the `pause` instruction. It idles for ~100 cycles on Skylake and later, or ~5 cycles on earlier Intel cores. Or spin on RDTSC. Inserting NOPs is never going to work reliably on modern superscalar / out-of-order x86 with huge reorder buffers! Your "sleep" time isn't necessarily even long enough to drain the out-of-order execution buffer (the ROB). Instructions don't have a cost you can just add up. [What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?](//stackoverflow.com/q/51607391). — Peter Cordes, Oct 15 '19 at 02:03
you are not going to get a delay to work timing code like that. need to use a timer. — old_timer, Oct 15 '19 at 14:08

Peter Cordes · Accepted Answer · 2019-10-16T20:34:29.830

5

Your results here are probably measurement noise and/or frequency scaling, since you start the timer for the 2nd interval right after printf returns from making a system call.

RDTSC counts reference cycles, not core clock cycles, so you're mostly just discovering the CPU frequency. (Lower core clock speed = more reference cycles for the same number of core clocks to run two rdtsc instructions). Your RDTSC instructions are basically back-to-back; the nop instructions are negligible compared to the amount of uops that rdtsc itself decodes to (on normal CPUs including your Broadwell).

Also RDTSC can be reordered by out-of-order execution. Not that nop does anything that the CPU would have to wait for; it's just delaying the front-end by 0.25 or 1.75 cycles from issuing the uops of the 2nd rdtsc. (Actually I'm not sure if the microcode sequencer can send uops in the same cycle as a uop from another instruction. So maybe 1 or 2 cycles).

My answer on How to get the CPU cycle count in x86_64 from C++? has a bunch of background on how RDTSC works.

You might want the pause instruction. It idles for ~100 core clock cycles on Skylake and later, or ~5 cycles on earlier Intel cores. Or spin on PAUSE + RDTSC. How to calculate time for an asm delay loop on x86 linux? shows a possibly-useful delay spinloop that sleeps for a given number of RDTSC counts. You need to know the reference clock speed to correlate that with nanoseconds, but it's typically around the rated max non-turbo clock on Intel CPUs. e.g. 4008 MHz on a 4.0GHz Skylake.

If available, tpause takes a TSC timestamp as the wake-up time. (See the link). But it's only low-power Tremont for now.

Inserting NOPs is never going to work reliably on modern superscalar / out-of-order x86 with huge reorder buffers! Modern x86 isn't a microcontroller where you can calculate iterations for a nested delay loop. If surrounding code doesn't bottleneck on the front-end, OoO exec is just going to hide the cost of feeding your NOPs through the pipeline.

Instructions don't have a cost you can just add up. To model the cost of an instruction, you need to know its latency, front-end uop count, and which back-end execution ports it needs. And any special effects on the pipeline, like lfence waiting for all previous uops to retire before later ones can issue. How many CPU cycles are needed for each assembly instruction?

See also What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?

Note that your desired "sleep" time of ~100ns isn't necessarily even long enough to drain the out-of-order execution buffer (the ROB) if there are cache misses in flight, or possibly even very a slow ALU dependency chain. (The latter is unlikely outside of artificial cases). So you probably don't want to do anything like lfence.

edited Oct 16 '19 at 20:34

answered Oct 15 '19 at 02:22

Peter Cordes

328,167
45
605
847

doesn't using `rdtscp` get around the reordering issue? – Gavin Portwood Oct 15 '19 at 03:42
@GavinPortwood: partially. It's a one-way barrier so it's useful for the end of a timed region. – Peter Cordes Oct 15 '19 at 03:46
1

@Peter - it's documented as a one way barrier but I haven't found any evidence that it behaves other than a full barrier (it seems to have lfence-like full barrier semantics when I tried it). I guess it's still useful in that it's tied to the instruction so you don't need two fences: before and after (to prevent the two different types of movement that might be a problem). – BeeOnRope Oct 15 '19 at 21:27
@BeeOnRope: I've seen people (e.g. Hadi) say they had better results fencing after as well at the bottom of a timed region, to stop later instructions from starting to execute while the rdtsc uops were still executing. Not sure why that would matter or if that was even a real effect. But yeah, I'd be surprised if later instructions could execute in parallel with any instructions *before* `rdtscp`; a mechanism like `lfence` would not be at all surprising. – Peter Cordes Oct 15 '19 at 21:56
@PeterCordes "Also RDTSC **and** can be reordered by out-of-order execution" Here either an "and" to much or a missing word. – Sep Roland Oct 16 '19 at 14:03
@SepRoland: thanks, I'm not sure how that happened, but I don't think I had anything profound that got left out. Just a stray "and" after editing; or fingers out of sync with brain. – Peter Cordes Oct 16 '19 at 15:39
@PeterCordes - yeah I remember that. Note that we are talking about slightly different scenarios here: I'm suggesting that `rdtscp` is perhaps more efficient than `lfence; rdtsc; lfence` even if "rdtscp implies lfence" because at least it only implies _one_ lfence, not two. Here "more efficient" in the sense of "the cost to run this sequence" i.e., how much your timing code costs you in the context of the calling code. IIRC Hadi found that `lfence; rdtsc; lfence` produced _more reliable_ timing results, from the PoV of timing the code inside the timed region. – BeeOnRope Oct 16 '19 at 20:20
That makes sense to me: `rdtsc` has one `lfence` type operation in there somewhere, but also a bunch of other uops either before, after or a bit of both - which could presumably interact with the timed code. `lfence; rdtsc; lfence` segregates everything more strongly and could in principle lead to more reproducible results. – BeeOnRope Oct 16 '19 at 20:22
@BeeOnRope: Oh yes, your first comment did still mention a fence after, sorry. Agreed `rdtscp ; lfence` is probably less total timing overhead than `lfence; rdtsc; lfence`, and otherwise very similar. Although it does spend at least an extra uop to get the core ID into a register. Or not: lfence+rdtsc is 2+20 uops on SKL, vs. 22 for rdtscp so it might literally be lfence+rdtsc in microcode, with a mov-immediate snuck in there. At least lfence is only 2 uops, not another microcoded instruction that needs a whole line of uop-cache to itself. – Peter Cordes Oct 16 '19 at 20:38
@PeterCordes well I'm also suggesting you can possibly just replace `lfence; rdtsc; lfence` with _just_ `rdtscp` because as far as I can tell it has the full lfence effect on current hardware and despite the documentation. That said, thinking about it hardware, maybe that's wrong: it has full lfence effect from the PoV of the surrounding code, but presumably not for all the uops making up the `rdtscp` itself: assuming the logical "internal" lfence is near the start I guess other ops could sneak into the `rdtscp` uops. – BeeOnRope Oct 16 '19 at 20:46
_Then again_ - later ALU-type uops shouldn't really interfere with earlier ones, should they? The schedulers are mostly oldest-uop-first, so it isn't totally clear how the interference happens. I mean some scenarios are possible: e.g., the later uops become ready earlier, so execute earlier **and** they are consuming some resource like the divider or a fill buffer that then is not available to the older-but-not-yet-executed uop when it becomes ready. Seems kinda obscure tho... – BeeOnRope Oct 16 '19 at 20:48
1

@BeeOnRope: well yeah, as we discussed it's not clear why `lfence; rdtsc; lfence` would be any better, but in practice it seemed to be. So I think we can say on real HW `rdtscp` = `lfence; rdtsc` pretty much exactly, assuming the lfence uops are (almost) the first. Whether you follow that with another `lfence` or not at the bottom of a timed region is a totally separate matter. – Peter Cordes Oct 16 '19 at 21:01
1

But yeah, oldest-ready-first = no problem was my thought too, given that execution units are fully pipelined and I don't expect rdtsc to touch the divider. Physical register allocation happens at issue/rename/allocate time, but a few are dependent on execution. (And BTW, I have no idea what all 22 of the rdtscp uops actually do, and how many of them *could* go before an lfence without sampling the clock before execution of all earlier uops. It seems like a sensible choice to do the lfence part first even if it wasn't necessarily required, though). – Peter Cordes Oct 16 '19 at 21:04

Multiple nop instructions do not consistently take longer than a single nop instruction

1 Answers1