switching fpu to single precision

Question

I have read that on older fpu a couple a years before fpu when switched to single precision mode did divisions and sqr twice as fast as in normal mode.

(check the source of it http://stereopsis.com/FPU.html)

Is it still the case, and switching like that can speed up some loops, making a lot of float code inside?

The second question related, Can I toy with FPU precision freely in my code when doing system (winapi) calls for example, Same with fpu rounding mode and with the system side, Can the api also spoil my settings of it?

If you're doing this for performance, why not just use SSE? I can't imagine trying to take performance seriously on a system old enough not to have SSE. — Mysticial, Oct 03 '12 at 14:33
I second Mysticial's comment with a slightly different viewpoint. The **historical** stack-based FPU that worked on 80-bit extended floating-point numbers could indeed be limited to 64-bit or 32-bit mantissas. Nowadays, we have the SSE2 instruction set with instructions that work directly on single- or double-precision numbers. The link you read is 12 years old. There is no reason to assume that it is still faster to fiddle with the old FPU instructions to limit precision. Even if it is, are you confident that the code you intend to speed up does not use SSE2 instructions? — Pascal Cuoq, Oct 03 '12 at 15:30
What If i just want to get maximum spped from fpu, just to know, or not to rewrite large fpu float code to sse (which will take couple of days, when switching to single precision will not take to much long) — grunge fightr, Oct 03 '12 at 15:47
@PascalCuoq: 80-bit x87 float *is* 64-bit mantissa width :P Limiting to 64-bit total width = limit to 53-bit mantissa precision = round every result to IEEE binary64 `double`. — Peter Cordes, Mar 07 '19 at 08:00
@PeterCordes You are right, this should read “significands of equivalent widths to IEEE 754 binary64 or binary32” or something like that, but it's too old to fix now. :) — Pascal Cuoq, Mar 07 '19 at 14:06

Peter Cordes · Answer 1 · 2019-03-07T08:05:13.733

Yes, Agner Fog's throughtput/latency numbers are fully consistent with reducing x87 precision speeding up the worst case.

It also makes sense given the way modern div/sqrt hardware works, using a Radix-16 or Radix-1024 divider that iteratively computes more bits of the result, so needing fewer correct bits means you can stop sooner. (How sqrt() of GCC works after compiled? Which method of root is used? Newton-Raphson? and The integer division algorithm of Intel's x86 processors)

It also makes sense given that x87 fdiv and SSE1 divss run on the same hardware, with divss having the same best case (round divisors) but a better worst-case. The x87 precision bits presumably control the HW divider exactly the same way that divss or divsd do.

details below

Yes, x87 can be limited to 64-bit or 32-bit total width (double or float), down from the standard 80-bit. And yes this does slightly speed up the fsqrt and fdiv worst cases to be about the same speed as scalar SSE/SSE2 of the same precision (sqrtss = scalar single / sqrtsd = scalar double). Nothing else runs faster or slower.

It does not make x87 faster than SSE, so at this point it's mostly a CPU-history curiousity.

Apparently DirectX does (used to?) actually set x87 precision to 24-bit mantissa (float), and MSVC's CRT startup used to set it to 53-bit mantissa (double). See Bruce Dawson's https://randomascii.wordpress.com/2012/03/21/intermediate-floating-point-precision/. But Microsoft's historical weirdness is the exception; other toolchains / OSes don't mess around with x87.

Agner Fog's instruction tables don't mention x87 precision for CPUs of Sandybridge or newer. This might mean that it no longer helps, or (I think) that Agner decided it wasn't worth mentioning. His SnB and newer tables don't have any footnotes, so I think that's the explanation. SnB's divider isn't very different from NHM, as far as I know.

For Nehalem:

fdiv 7-27 cycles latency = throughput (not pipelined at all), with a footnote that says Round divisors or low precision give low values.
divsd/divpd 7-22 cycles latency=throughput.
divss/divps 7-14 cycles latency=throughput.

So the best-case performance (divider occupied for 7 cycles) is the same for all forms, with worst case getting worse the more mantissa bits are possible.

We know that divider HW is iterative and has to keep going longer to compute more bits, so it's 100% plausible that setting x87 precision to 24 or 53-bit helps performance in exactly the same way that using divss does. They share the same hardware execution unit anyway.

IvyBridge did finally pipeline the FP divider. Haswell didn't make any major changes vs. IvB to div numbers. These are the HSW numbers:

fdiv 10-24c latency, 8-18c throughput
divsd / divpd xmm: 10-20c latency, 8-14c throughput
divss / divps xmm: 10-13c latency, 7c throughput (fixed latency is nice for the scheduler)

See also Floating point division vs floating point multiplication where I collected up Agner Fog's data for recent Intel CPUs, including 256-bit YMM vectors. I left out x87 there because it's basically irrelevant for high performance.

Normally you'd just use SSE1 because it's generally faster (no front-end bandwidth spent on fxch and fld register copies thanks to a flat register set and 2-operand instructions instead of a stack). And the opportunity to use SIMD for some cases (typically 4x float sqrt results in the same time as 1) makes it a huge win vs. setting the x87 FPU to 32-bit.

Most SSE math instructions have similar throughput and latency to their x87 counterparts, but x87 has more overhead.

If you need to make a 32-bit binary that's compatible with ancient CPUs without even SSE1, yes you could reduce the x87 precision to 24-bit if fdiv and fsqrt performance are important for your code. (Might possibly also speed up some of the microcoded x87 instructions like fsin and fyl2x, IDK.)

Or if reducing to precision to float is too drastic, then you're looking at SSE2 for double math in XMM regs. It's baseline for x86-64, so again only worth thinking about if you for some reason have to make a 32-bit binary. The newest CPU without it is Athlon XP. (If you don't count stuff like current Geode.)

Same with fpu rounding mode and with the system side, Can the api also spoil my settings of it?

AFAIK, nothing will ever leave the rounding mode changed. That would be a big difference, and doesn't help performance.

If anyone had ever been able to justify doing so, someone would have done it for performance of C that uses (int)float without SSE convert-with-truncation instructions (or SSE3 fisttp for an x87 version), to avoid having to set the x87 rounding mode to truncation (toward 0) and then restore it every time an FP value is converted to an integer.

Most compilers assume round-to-nearest when optimizing.

score -1 · Answer 2 · answered Oct 03 '12 at 16:41

-1

My understanding is that the effect of precision on speed on the legacy x86 FPU pretty much ended with the i486. It was a common optimization back the 8087 days though.

answered Oct 03 '12 at 16:41

Brian Knoblauch

20,639
15
57
92

seem that I will have to measure it myself one day :/ personally i think it still may give speedup to divs – grunge fightr Oct 03 '12 at 16:44

switching fpu to single precision

2 Answers2