Interpreting Absurdly-Low Measured Latency in Careful Profile (Superscalarity Effects?)

Question

I've written some code for profiling small functions. At the high level it:

Sets the thread affinity to only one core and the thread priority to maximum.
Computes statistics from doing the following 100 times:
1. Estimate the latency of a function that does nothing.
2. Estimate the latency of the test function.
3. Subtract the first from the second to remove the cost of doing function-call overhead, thereby roughly getting the cost of the test function's contents.

To estimate the latency of a function, it:

Invalidates caches (this is difficult to actually do in user-mode, but I allocate and write a buffer the size of the L3 to memory, which should maybe help).
Yields the thread, so that the profile loop has as-few-as-possible context switches.
Gets the current time from a std::chrono::high_resolution_clock (which seems to compile to system_clock, but).
Runs the profile loop 100,000,000 times, calling the tested function within.
Gets the current time from a std::chrono::high_resolution_clock and subtracts to get latency.

Because at this level, individual instructions matter, at all points we have to write very careful code to ensure that the compiler doesn't elide, inline, cache, or treat-differently the functions. I have manually validated the generated assembly in various test cases, including the one which I present below.

I am getting extremely low (sub-nanosecond) latencies reported in some cases. I have tried everything I can think of to account for this, but cannot find an error.

I am looking for an explanation accounting for this behavior. Why are my profiled functions taking so little time?

Let's take the example of computing a square root for float.

The function signature is float(*)(float), and the empty function is trivial:

empty_function(float):
    ret

Let's compute the square root by using the sqrtss instruction, and by the multiplication-by-reciprocal-square-root hack. I.e., the tested functions are:

sqrt_sseinstr(float):
    sqrtss  xmm0, xmm0
    ret
sqrt_rcpsseinstr(float):
    movaps  xmm1, xmm0
    rsqrtss xmm1, xmm0
    mulss   xmm0, xmm1
    ret

Here's the profile loop. Again, this same code is called with the empty function and with the test functions:

double profile(float):
    ...

    mov    rbp,rdi

    push   rbx
    mov    ebx, 0x5f5e100

    call   1c20 <invalidate_caches()>
    call   1110 <sched_yield()>

    call   1050 <std::chrono::high_resolution_clock::now()>

    mov    r12, rax
    xchg   ax,  ax

    15b0:
    movss  xmm0,DWORD PTR [rip+0xba4]
    call   rbp
    sub    rbx, 0x1
    jne    15b0 <double profile(float)+0x20>

    call   1050 <std::chrono::high_resolution_clock::now()>

    ...

The timing result for sqrt_sseinstr(float) on my Intel 990X is 3.60±0.13 nanoseconds. At this processor's rated 3.46 GHz, that works out to be 12.45±0.44 cycles. This seems pretty spot-on, given that the docs say the latency of sqrtss is around 13 cycles (it's not listed for this processor's Nehalem architecture, but it seems likely to also be around 13 cycles).

the timing result for sqrt_rcpsseinstr(float) is stranger: 0.01±0.07 nanoseconds (or 0.02±0.24 cycles). This is flatly implausible unless another effect is going on.

I thought perhaps the processor is able to hide the latency of the tested function somewhat or perfectly because the tested function uses different instruction ports (i.e. superscalarity is hiding something)? I tried to analyze this by hand, but didn't get very far because I didn't really know what I was doing.

^{(Note: I cleaned up some of the assembly notation for your convenience. An unedited objdump of the whole program, which includes several other variants, is here, and I am temporarily hosting the binary here (x86-64 SSE2+, Linux).)}

The question, again: Why are some profiled functions producing implausibly small values? If it is a higher-order effect, explain?

The cost of the surrounding loop and of the return from the null function and the real function are all going to vary by many cycles depending on whether the branches are correctly predicted. In general, this sort of micro-benchmarking tends to be futile. — Patricia Shanahan, Mar 02 '19 at 00:39
Have you tried using the `RDTSC` instruction for measuring time? Don't forget to use something like `CPUID` to prevent the CPU's out-of-order execution messing you up. Other things to consider: disabling interrupts, disabling turbo (i.e. fixing the clock rate) and monitoring CPU counters to see that the expected number of instructions are actually executed (e.g. `perf` utility). — zinga, Mar 02 '19 at 01:51
You're not *using* the result of the function, so the processor can overlap instructions almost arbitrarily (out of order + no dependencies + your benchmark*ing* code doesn't use vector/floating-point registers while the benchmark*ed* code *only* uses vector/floating-point registers), so only long instruction latencies on poorly-pipelined instructions/ports (`sqrtss`) will show up at all. Effectively, you're measuring instruction *throughput* rather than *latency*. — EOF, Mar 02 '19 at 09:08
From [Agner's instruction tables](https://www.agner.org/optimize/instruction_tables.pdf), the throughput of `sqrtss` and `rsqrtss` is 7-18 and 2 cycles, respectively. Also the throughput of `mulss` is 1 cycle (which is dependent on `rsqrtss`). The throughput 12.45±0.44 of the first function might actually smaller than what it really is. I think the empty function time is an overestimate of the measurement overhead. You should inline the code and use RDTSC instead. See: https://stackoverflow.com/questions/13772567/get-cpu-cycle-count. — Hadi Brais, Mar 02 '19 at 17:37
I get tired of waiting for Peter Cordes to appear, particularly because so many of the comments are just ridiculously wrong. (No, `rdtsc` is not the answer, though `rdtscp` would be, somewhat incidentally) — EOF, Mar 02 '19 at 18:35
For what it's worth, the OP doesn't make it clear whether they want to measure the latency of the throughput of the function. They have set it up to measure throughput, I guess that is _usually_ the more interesting value for mathematical functions like this, where stuff like `sqrt(sqrt(sqrt(x)))` isn't all that common. Still, the OP could clarify whether they want to measure the time it takes for repeated function calls assuming (1) the result of one call is fed into the next (latency) OR (2) the input of the next function is independent of the prior (throughput). — BeeOnRope, Mar 03 '19 at 01:06
@BeeOnRope Except that "latency" is literally in the title of the question. — EOF, Mar 03 '19 at 05:30
@EOF - right, but people often use that world loosely when they are talking about "timing" and when you dig into their actual use case they don't necessarily care about about true output-feeds-inputs latency. In any case, the OP mentions "latency" but then sets up what we'd call a "throughput" test, so either way there is an inconsistency, either with the term they used or the way they set up their test. I'm hoping for a clarification from the OP. — BeeOnRope, Mar 03 '19 at 15:40

BeeOnRope · Answer 1 · 2019-03-03T00:24:44.970

The problem is with the basic approach of subtracting out the "latency"¹ of an empty function, as described:

Estimate the latency of a function that does nothing.

Estimate the latency of the test function.

Subtract the first from the second to remove the cost of doing function-call overhead, thereby roughly getting the cost of the test function's contents.

The built-in assumption is that the cost of calling a function is X, and if the latency of the work done in the function is Y, then the total cost will be something like X + Y.

This isn't generally true for any two blocks of work and especially isn't true when when one of them is "calling a function". A more sophisticated view would be that the total time would be somewhere between min(X, Y) and X + Y - but even this is often wrong depending on the details. Still, it's enough of a refinement to explain what is going on here: the cost of the function is not additive with the work being doing in the function: they happen in parallel.

The cost of an empty function call is something like 4 to 5 cycles on modern Intel, probably bottlenecked on the front-end throughput for the two taken branches, and possibly by branch and return predictor latency.

However, when you add additional work to an empty function, it generally won't compete for the same resources, and its execution instructions won't depend on the "output" of the call (i.e., the work will form a separate dependency chain), except perhaps in rare cases where the stack pointer is manipulated and the stack engine doesn't remove the dependency.

So essentially the function will take the greater of the time needed for the function call mechanics, or the actual work done by the function. This approximation isn't exact, because some types of work may actually add to the overhead of the function call (e.g., if there are enough instructions for the front end to get through before getting to the ret, the total time may increase on top of the 4-5 cycle empty function time, even if the total work is less than that) - but it's a good first order approximation.

Your first function takes enough time that the actual work dominates the execution time. The second function is much faster, however, enabling it to "hide under" the existing time taken by the call/ret mechanics.

The solution is simple: duplicate the work within the function N times, so that the work always dominates. N=10 or N=50 or something like that is fine. You have to decide whether you want to test latency, in which case the output of one copy of the work should feed into the next, or throughput, in which case it shouldn't.

On other hand, if you actually want to test the cost of the function call + work, e.g., because that's how you'll be using it in real life, it is likely the results you have gotten is already close to correct: stuff really can be "incrementally free" when it hides behind a function call.

¹ I'm putting "latency" in quotes here because it isn't clear whether we should be talking about the latency of call/ret or the throughput. call and ret don't have any explicit outputs (and ret has no input), so it doesn't participate in a classic register-based dependency chain - but it might make sense to think of latency if you consider other hidden architectural components like the instruction pointer. In either case latency of throughput mostly points down to the same thing because all call and ret on a thread operate on the same state, so it doesn't make sense to have say "independent" vs "dependent" call chains.

EOF · Answer 2 · 2019-03-02T18:37:40.270

Your benchmarking approach is fundamentally wrong, and your "careful code" is bogus.

First, emptying the cache is bogus. Not only will it quickly be repopulated with the required data, but also the examples you have posted have very little memory interaction (only cache access by call/ret and a load we'll get to.

Second, yielding before the benchmarking loop is bogus. You iterate 100000000 times, which even on a reasonably fast modern processor will take longer than typical scheduling clock interrupts on a stock operating system. If, on the other hand, you disable scheduling clock interrupts, then yielding before the benchmark doesn't do anything.

Now that the useless incidental complexity is out of the way, about the fundamental misunderstanding of modern CPUs:

You expect loop_time_gross/loop_count to be the time spent in each loop iteration. This is wrong. Modern CPUs do not execute instructions one after the other, sequentially. Modern CPUs pipeline, predict branches, execute multiple instructions in parallel, and (reasonably fast CPUs) out of order.

So after the first handful of iterations of the benchmarking loop, all branches are perfectly predicted for the next almost 100000000 iterations. This enables the CPU to speculate. Effectively, the conditional branch in the benchmarking loop goes away, as does most of the cost of the indirect call. Effectively, the CPU can unroll the loop:

movss  xmm0, number
movaps  xmm1, xmm0
rsqrtss xmm1, xmm0
mulss   xmm0, xmm1
movss  xmm0, number
movaps  xmm1, xmm0
rsqrtss xmm1, xmm0
mulss   xmm0, xmm1
movss  xmm0, number
movaps  xmm1, xmm0
rsqrtss xmm1, xmm0
mulss   xmm0, xmm1
...

or, for the other loop

movss  xmm0, number
sqrtss  xmm0, xmm0
movss  xmm0, number
sqrtss  xmm0, xmm0
movss  xmm0, number
sqrtss  xmm0, xmm0
...

Notable, the load of number is always the same (thus quickly cached), and it overwrites the just computed value, breaking the dependency chain.

To be fair, the

call   rbp
sub    rbx, 0x1
jne    15b0 <double profile(float)+0x20>

are still executed, but the only resources they take from the floating-point code are decode/micro-op cache and execution ports. Notably, while the integer loop code has a dependency chain (ensuring a minimum execution time), the floating-point code does not carry a dependency on it. Furthermore, the floating-point code consists of many mutually totally independent short dependency chains.

Where you expect the CPU to execute instructions sequentially, the CPU can instead execute them in parallel.

A small look at https://agner.org/optimize/instruction_tables.pdf reveals why this parallel execution doesn't work for sqrtss on Nehalem:

instruction: SQRTSS/PS
latency: 7-18
reciprocal throughput: 7-18

i.e., the instruction cannot be pipelined and only runs on one execution port. In contrast, for movaps, rsqrtss, mulss:

instruction: MOVAPS/D
latency: 1
reciprocal throughput: 1

instruction: RSQRTSS
latency: 3
reciprocal throughput: 2

instruction: MULSS
latency: 4
reciprocal throughput: 1

the maximum reciprocal throughput of the dependency chain is 2, so you can expect the code to finish executing one dependency chain every 2 cycles in the steady state. At this point, the execution time of the floating-point part of the benchmarking loop is less than or equal to the loop overhead and overlapped with it, so your naive approach to subtract the loop overhead leads to nonsensical results.

If you wanted to do this properly, you would ensure that separate loop iterations are dependent on each other, for example by changing your benchmarking loop to

float x = INITIAL_VALUE;
for (i = 0; i < 100000000; i++)
    x = benchmarked_function(x);

Obviously you will not benchmark the same input this way, unless INITIAL_VALUE is a fixed point of benchmarked_function(). However, you can arrange for it to be a fixed point of an expanded function by computing float diff = INITIAL_VALUE - benchmarked_function(INITIAL_VALUE); and then making the loop

float x = INITIAL_VALUE;
for (i = 0; i < 100000000; i++)
    x = diff + benchmarked_function(x);

with relatively minor overhead, though you should then ensure that floating-point errors do not accumulate to significantly change the value passed to benchmarked_function().

@PeterCordes Honestly, I haven't had a Nehalem in ages, and I'm not particularly familiar with the uArch, so there are a few points in this answer where I'm probably off by a fair bit. In particular the use of execution ports by `rsqrtss` and `mulss` may conflict and cause throughput to be a lot lower than what I wrote. I would not mind if you wrote an answer or edited this one. — EOF, Mar 02 '19 at 19:14
(1) The OP has only provided an *example* piece of code whose execution time need to be measured. So flushing the caches might be needed for other pieces of code *in general*. (2) Similarly, yielding the time slice of the thread might be useful in general, for example, when the execution time of the timed region, is small enough. (3) I'm pretty sure someone who can write such a question has some idea about how modern CPUs work. (4) The `call` instruction is not completely for free, it does gets decoded and it does consume execution resources... — Hadi Brais, Mar 02 '19 at 19:19
...(5) So what if that load breaks the dep chain? That's what the OP wants apparently. (6) I think the loop throughput may be more than 2 cycles because `rsqrtss` and `mulss` contends on ports 0 and 1, but we have to measure it. (7) I think subtracting the call overhead is not leading to *nonsensical results*, but to results that are a couple of cycles smaller that what the actual throughput really is. — Hadi Brais, Mar 02 '19 at 19:19
My comment under the question is not *ridiculously wrong*. Using RDTSC allows the OP to use a smaller a number of iterations (about 100 vs. 100 million). There may be no need for RDTSCP in this case, but sure, RDTSCP can be used. — Hadi Brais, Mar 02 '19 at 19:21
@HadiBrais I really don't care to argue with you about what should be obvious, but here we go: If the OP is confused, then making a [MCVE] is clearly a good idea. If the problem appears in an [MCVE] without cache and with large loop counts, then yes, the cache cleaning and yielding is meaningless fluff. Since a lower loop count would not matter, neither does `rdtsc`. If you can explain to me why I said that `rdtscp` *does* help, then I might start to believe you are at all qualified to discuss this with me. — EOF, Mar 02 '19 at 19:29
In general simply changing a benchmarking method from latency to throughput doesn't solve this problem. It would probably work here because latency is slow enough compared to inv throughput that the call cost will be exceeded, but that certainly isn't always the case. — BeeOnRope, Mar 03 '19 at 01:07
Sorry I meant "change from throughput to latency" above, not the other way around. — BeeOnRope, Mar 03 '19 at 15:41

Interpreting Absurdly-Low Measured Latency in Careful Profile (Superscalarity Effects?)

2 Answers2