Yes, Agner Fog's throughtput/latency numbers are fully consistent with reducing x87 precision speeding up the worst case.
It also makes sense given the way modern div/sqrt hardware works, using a Radix-16 or Radix-1024 divider that iteratively computes more bits of the result, so needing fewer correct bits means you can stop sooner. (How sqrt() of GCC works after compiled? Which method of root is used? Newton-Raphson? and The integer division algorithm of Intel's x86 processors)
It also makes sense given that x87 fdiv
and SSE1 divss
run on the same hardware, with divss
having the same best case (round divisors) but a better worst-case. The x87 precision bits presumably control the HW divider exactly the same way that divss
or divsd
do.
details below
Yes, x87 can be limited to 64-bit or 32-bit total width (double
or float
), down from the standard 80-bit. And yes this does slightly speed up the fsqrt
and fdiv
worst cases to be about the same speed as scalar SSE/SSE2 of the same precision (sqrtss
= scalar single / sqrtsd
= scalar double). Nothing else runs faster or slower.
It does not make x87 faster than SSE, so at this point it's mostly a CPU-history curiousity.
Apparently DirectX does (used to?) actually set x87 precision to 24-bit mantissa (float
), and MSVC's CRT startup used to set it to 53-bit mantissa (double
). See Bruce Dawson's https://randomascii.wordpress.com/2012/03/21/intermediate-floating-point-precision/. But Microsoft's historical weirdness is the exception; other toolchains / OSes don't mess around with x87.
Agner Fog's instruction tables don't mention x87 precision for CPUs of Sandybridge or newer. This might mean that it no longer helps, or (I think) that Agner decided it wasn't worth mentioning. His SnB and newer tables don't have any footnotes, so I think that's the explanation. SnB's divider isn't very different from NHM, as far as I know.
For Nehalem:
fdiv
7-27 cycles latency = throughput (not pipelined at all), with a footnote that says Round divisors or low precision give low values.
divsd
/divpd
7-22 cycles latency=throughput.
divss
/divps
7-14 cycles latency=throughput.
So the best-case performance (divider occupied for 7 cycles) is the same for all forms, with worst case getting worse the more mantissa bits are possible.
We know that divider HW is iterative and has to keep going longer to compute more bits, so it's 100% plausible that setting x87 precision to 24 or 53-bit helps performance in exactly the same way that using divss
does. They share the same hardware execution unit anyway.
IvyBridge did finally pipeline the FP divider. Haswell didn't make any major changes vs. IvB to div numbers. These are the HSW numbers:
fdiv
10-24c latency, 8-18c throughput
divsd
/ divpd xmm
: 10-20c latency, 8-14c throughput
divss
/ divps xmm
: 10-13c latency, 7c throughput (fixed latency is nice for the scheduler)
See also Floating point division vs floating point multiplication where I collected up Agner Fog's data for recent Intel CPUs, including 256-bit YMM vectors. I left out x87 there because it's basically irrelevant for high performance.
Normally you'd just use SSE1 because it's generally faster (no front-end bandwidth spent on fxch
and fld
register copies thanks to a flat register set and 2-operand instructions instead of a stack). And the opportunity to use SIMD for some cases (typically 4x float sqrt results in the same time as 1) makes it a huge win vs. setting the x87 FPU to 32-bit.
Most SSE math instructions have similar throughput and latency to their x87 counterparts, but x87 has more overhead.
If you need to make a 32-bit binary that's compatible with ancient CPUs without even SSE1, yes you could reduce the x87 precision to 24-bit if fdiv
and fsqrt
performance are important for your code. (Might possibly also speed up some of the microcoded x87 instructions like fsin
and fyl2x
, IDK.)
Or if reducing to precision to float
is too drastic, then you're looking at SSE2 for double
math in XMM regs. It's baseline for x86-64, so again only worth thinking about if you for some reason have to make a 32-bit binary. The newest CPU without it is Athlon XP. (If you don't count stuff like current Geode.)
Same with fpu rounding mode and with the system side, Can the api also spoil my settings of it?
AFAIK, nothing will ever leave the rounding mode changed. That would be a big difference, and doesn't help performance.
If anyone had ever been able to justify doing so, someone would have done it for performance of C that uses (int)float
without SSE convert-with-truncation instructions (or SSE3 fisttp
for an x87 version), to avoid having to set the x87 rounding mode to truncation (toward 0) and then restore it every time an FP value is converted to an integer.
Most compilers assume round-to-nearest when optimizing.