Is there any way to get correct rounding with the i387 fsqrt instruction?

Question

Is there any way to get correct rounding with the i387 fsqrt instruction?...

...aside from changing the precision mode in the x87 control word - I know that's possible, but it's not a reasonable solution because it has nasty reentrancy-type issues where the precision mode will be wrong if the sqrt operation is interrupted.

The issue I'm dealing with is as follows: the x87 fsqrt opcode performs a correctly-rounded (per IEEE 754) square root operation in the precision of the fpu registers, which I'll assume is extended (80-bit) precision. However, I want to use it to implement efficient single and double precision square root functions with the results correctly rounded (per the current rounding mode). Since the result has excess precision, the second step of converting the result to single or double precision rounds again, possibly leaving a not-correctly-rounded result.

With some operations it's possible to work around this with biases. For instance, I can avoid excess precision in the results of addition by adding a bias in the form of a power of two that forces the 52 significant bits of a double precision value into the last 52 bits of the 63-bit extended-precision mantissa. But I don't see any obvious way to do such a trick with square root.

Any clever ideas?

(Also tagged C because the intended application is implementation of the C sqrt and sqrtf functions.)

Out of curiosity: Is there any reason you can't use SSE2 math here? — , Mar 13 '12 at 04:20
Because the target is all x86 machines, not post-Pentium-2 or whatever. — R.. GitHub STOP HELPING ICE, Mar 13 '12 at 04:24
Doesn't storing it 4 or 8 byte memory do the rounding? Or is that too much overhead? — Mysticial, Mar 13 '12 at 08:31
That performs a *second* rounding step. Suppose I ask you to round 1.49 to an integer. Rounding it as one step yields 1. First rounding it to one place after the decimal point yields 1.5, then rounding it to an integer yields 2. Similarly, `fsqrt` performs one rounding (since the exact value of square root is almost never representable) and converting it from 80-bit extended precision to the right type performs another rounding. — R.. GitHub STOP HELPING ICE, Mar 13 '12 at 15:11
Oh ic what you mean. I'm tempted to think that the mathematical properties of the square root will prohibit such edge cases from ever occurring. But that's a little bit outside my area of expertise. — Mysticial, Mar 13 '12 at 15:43
It's not often that it happens, but there are sufficiently many doubles that it happens now and then. The exact criterion comes out to be something like bits 52-64 of the exact result looking like 101000...00, followed by a nonzero tail somewhere past the end of the 64-bit extended precision mantissa. One could perhaps work backwards and enumerate the cases, but I think it's way too many to special-case them. — R.. GitHub STOP HELPING ICE, Mar 13 '12 at 15:54

score 15 · Accepted Answer · edited Mar 13 '12 at 19:11

15

First, let's get the obvious out of the way: you should be using SSE instead of x87. The SSE sqrtss and sqrtsd instructions do exactly what you want, are supported on all modern x86 systems, and are significantly faster as well.

Now, if you insist on using x87, I'll start with the good news: you don't need to do anything for float. You need 2p + 2 bits to compute a correctly rounded square-root in a p-bit floating-point format. Because 80 > 2*24 + 2, the additional rounding to single-precision will always round correctly, and you have a correctly rounded square root.

Now the bad news: 80 < 2*53 + 2, so no such luck for double precision. I can suggest several workarounds; here's a nice easy one off the top of my head.

let y = round_to_double(x87_square_root(x));
use a Dekker (head-tail) product to compute a and b such that y*y = a + b exactly.
compute the residual r = x - a - b.
if (r == 0) return y
if (r > 0), let y1 = y + 1 ulp, and compute a1, b1 s.t. y1*y1 = a1 + b1. Compare r1 = x - a1 - b1 to r, and return either y or y1, depending on which has the smaller residual (or the one with zero low-order bit, if the residuals are equal in magnitude).
if (r < 0), do the same thing for y1 = y - 1 ulp.

This proceedure only handles the default rounding mode; however, in the directed rounding modes, simply rounding to the destination format does the right thing.

edited Mar 13 '12 at 19:11

Mysticial

464,885
45
335
332

answered Mar 13 '12 at 18:17

Stephen Canon

103,815
19
183
269

+1 The only time double-rounding can go bad is if the first rounding is up. Forced truncation will get rid of this problem. – Mysticial Mar 13 '12 at 18:26
@Mysticial: it's not true that double-rounding is only a problem if the first rounding is up. Consider a value of the form `...0 100...00 0...1` (where the spaces denote the round points). If we round directly to the first round point, we round up to `...1`. However, if we first round to the second rounding point, we round down to `...0 100...00`; rounding again at the first round point rounds down to `...0`. – Stephen Canon Mar 13 '12 at 18:33
@Mysticial: these things are rather subtle, in fairness. – Stephen Canon Mar 13 '12 at 18:41
Indeed, with half-up rounding it's at lot simpler. Unfortunately in the real world we need round-to-even to avoid ugly biases creeping in. – R.. GitHub STOP HELPING ICE Mar 13 '12 at 18:57
@StephenCanon: It looks like your algorithm has an obvious opportunity to stop at step 1 if bits 53-64 uniquely determine the rounding no matter what the subsequent lost bits were. Only corner cases should require the extra work. – R.. GitHub STOP HELPING ICE Mar 13 '12 at 19:07
And.. do you have a reference for step 2? "Dekker product" turns up no relevant hits. – R.. GitHub STOP HELPING ICE Mar 13 '12 at 19:08
@R..: of course, you can early out. I was trying to keep it as simple as possible =). The "Dekker product" is a trick due to T.J. Dekker for computing a+b = x*y exactly in floating point. It was published in "A floating point technique for extending the available precision" (1971), but you can find the algorithm detailed in basically any text on floating-point implementation. I know that it's in the crlibm manual, as well, which you can easily download. – Stephen Canon Mar 13 '12 at 19:13
1

(The Dekker product is called "Mul22" in crlibm, by the way) – Stephen Canon Mar 13 '12 at 19:14
1

Actually I think it's Mul12Cond, but thanks for the reference; I was able to find it with that. Having reviewed the algorithm casually, this seems to fully answer the question, so accepted! – R.. GitHub STOP HELPING ICE Mar 13 '12 at 19:34
Am I correct in my reasoning that that only time we need to do anything past step 1 is when the final 11 bits of the extended precision result are `10000000000`? – R.. GitHub STOP HELPING ICE Mar 14 '12 at 23:58
@Stephen: I think this algorithm can be improved a lot if you work with the not-rounded-to-double, extended-precision result as `y` rather than the rounded value. It seems to me you just need to determine whether `y` is less than or greater than the actual value of `sqrt(x)`, and then (respectively) increment or decrement it by 1ulp (extended precision) before rounding to double precision. (Note that this is assuming you're in the case where the extended precision value of `y` ends in `10000000000`.) – R.. GitHub STOP HELPING ICE Mar 15 '12 at 01:41
@Stephen: And with that in mind, the i387 status word's C1 condition is 1 if the inexact result was rounded up, and 0 if it was rounded down. – R.. GitHub STOP HELPING ICE Mar 15 '12 at 03:11
@R..: Yes, there are *lots* of ways to improve it (none of them nearly as good as simply using `sqrtsd`). I wanted to give one that was easy to explain numerically and didn't depend on obscure processor features or lots of bit-pattern manipulation--not what I personally would do if implementing a library. You certainly can do it exactly as you describe, though I would note that you should actually measure the performance of branching on C1; it's precisely the sort of thing that may have once been fast but now introduce some architectural hazard. – Stephen Canon Mar 15 '12 at 13:06

R.. GitHub STOP HELPING ICE · Answer 2 · 2012-03-15T05:10:45.997

OK, I think I have a better solution:

Compute y=sqrt(x) in extended precision (fsqrt).
If last 11 bits are not 0x400, simply convert to double precision and return.
Add 0x100-(fpu_status_word&0x200) to the low word of the extended precision representation.
Convert to double precision and return.

Step 3 is based on the fact that the C1 bit (0x200) of the status word is 1 if and only if fsqrt's result was rounded up. This is valid because, due to the test in step 2, x was not a perfect square; if it were a perfect square, y would have no bits beyond double precision.

It may be faster to perform step 3 with a conditional floating point operating rather than working on the bit representation and reloading.

Here's the code (seems to work in all cases):

sqrt:
    fldl 4(%esp)
    fsqrt
    fstsw %ax
    sub $12,%esp
    fld %st(0)
    fstpt (%esp)
    mov (%esp),%ecx
    and $0x7ff,%ecx
    cmp $0x400,%ecx
    jnz 1f
    and $0x200,%eax
    sub $0x100,%eax
    sub %eax,(%esp)
    fstp %st(0)
    fldt (%esp)
1:  add $12,%esp
    fstpl 4(%esp)
    fldl 4(%esp)
    ret

This approach seems sound, on cursory inspection (and is certainly closer to what I would choose in implementing a library myself). You might run it on Jerome Coonen's test vectors for an additional data point. Why `fld + fstp` instead of `fst`? — Stephen Canon, Mar 15 '12 at 13:13
As far as I know, there is no extended-precision version of `fst`, only `fstp`. — R.. GitHub STOP HELPING ICE, Mar 15 '12 at 16:54

score 0 · Answer 3 · answered Mar 13 '12 at 17:36

0

It may not be what you want, as it doesn't take advantage of the 387 fsqrt instruction, but there's a surprisingly efficient sqrtf(float) in glibc implemented with 32-bit integer arithmetic. It even handles NaNs, Infs, subnormals correctly - it might be possible to eliminate some of these checks with real x87 instructions / FP control word flags. see: glibc-2.14/sysdeps/ieee754/flt-32/e_sqrtf.c

The dbl-64/e_sqrt.c code is not so friendly. It's hard to tell what assumptions are being made at a glance. Curiously, the library's i386 sqrt[f|l] implementations just call fsqrt, but load the value differently. flds for SP, fldl for DP.

answered Mar 13 '12 at 17:36

Brett Hale

21,653
2
61
90

I'll look at the integer code. I wonder if there's a similar approach for double precision.. – R.. GitHub STOP HELPING ICE Mar 13 '12 at 17:55
@R.., I suspected it might be. What are the error bounds for an IEEE-754 sqrt? Is it 1/2(ulp)? Does the rounding mode affect the internal calculation, or just returned value? – Brett Hale Mar 13 '12 at 18:10
The result is just supposed to be correctly rounded in the current rounding mode. Transcendental functions are less strict; they need not be correctly rounded as long as the returned result is correct within 1ulp. – R.. GitHub STOP HELPING ICE Mar 13 '12 at 18:47

Is there any way to get correct rounding with the i387 fsqrt instruction?

3 Answers3

Linked