I've written some code for profiling small functions. At the high level it:
- Sets the thread affinity to only one core and the thread priority to maximum.
Computes statistics from doing the following 100 times:
- Estimate the latency of a function that does nothing.
- Estimate the latency of the test function.
- Subtract the first from the second to remove the cost of doing function-call overhead, thereby roughly getting the cost of the test function's contents.
To estimate the latency of a function, it:
- Invalidates caches (this is difficult to actually do in user-mode, but I allocate and write a buffer the size of the L3 to memory, which should maybe help).
- Yields the thread, so that the profile loop has as-few-as-possible context switches.
- Gets the current time from a
std::chrono::high_resolution_clock
(which seems to compile tosystem_clock
, but). - Runs the profile loop 100,000,000 times, calling the tested function within.
- Gets the current time from a
std::chrono::high_resolution_clock
and subtracts to get latency.
Because at this level, individual instructions matter, at all points we have to write very careful code to ensure that the compiler doesn't elide, inline, cache, or treat-differently the functions. I have manually validated the generated assembly in various test cases, including the one which I present below.
I am getting extremely low (sub-nanosecond) latencies reported in some cases. I have tried everything I can think of to account for this, but cannot find an error.
I am looking for an explanation accounting for this behavior. Why are my profiled functions taking so little time?
Let's take the example of computing a square root for float
.
The function signature is float(*)(float)
, and the empty function is trivial:
empty_function(float):
ret
Let's compute the square root by using the sqrtss
instruction, and by the multiplication-by-reciprocal-square-root hack. I.e., the tested functions are:
sqrt_sseinstr(float):
sqrtss xmm0, xmm0
ret
sqrt_rcpsseinstr(float):
movaps xmm1, xmm0
rsqrtss xmm1, xmm0
mulss xmm0, xmm1
ret
Here's the profile loop. Again, this same code is called with the empty function and with the test functions:
double profile(float):
...
mov rbp,rdi
push rbx
mov ebx, 0x5f5e100
call 1c20 <invalidate_caches()>
call 1110 <sched_yield()>
call 1050 <std::chrono::high_resolution_clock::now()>
mov r12, rax
xchg ax, ax
15b0:
movss xmm0,DWORD PTR [rip+0xba4]
call rbp
sub rbx, 0x1
jne 15b0 <double profile(float)+0x20>
call 1050 <std::chrono::high_resolution_clock::now()>
...
The timing result for sqrt_sseinstr(float)
on my Intel 990X is 3.60±0.13 nanoseconds. At this processor's rated 3.46 GHz, that works out to be 12.45±0.44 cycles. This seems pretty spot-on, given that the docs say the latency of sqrtss
is around 13 cycles (it's not listed for this processor's Nehalem architecture, but it seems likely to also be around 13 cycles).
the timing result for sqrt_rcpsseinstr(float)
is stranger: 0.01±0.07 nanoseconds (or 0.02±0.24 cycles). This is flatly implausible unless another effect is going on.
I thought perhaps the processor is able to hide the latency of the tested function somewhat or perfectly because the tested function uses different instruction ports (i.e. superscalarity is hiding something)? I tried to analyze this by hand, but didn't get very far because I didn't really know what I was doing.
(Note: I cleaned up some of the assembly notation for your convenience. An unedited objdump
of the whole program, which includes several other variants, is here, and I am temporarily hosting the binary here (x86-64 SSE2+, Linux).)
The question, again: Why are some profiled functions producing implausibly small values? If it is a higher-order effect, explain?