How to fix this floating point square root algorithm

Question

I am trying to compute the IEEE-754 32-bit Floating Point Square Root of various inputs but for one particular input the below algorithm based upon the Newton-Raphson method won't converge, I am wondering what I can do to fix the problem? For the platform I am designing I have a 32-bit floating point adder/subtracter, multiplier, and divider.

For input 0x7F7FFFFF (3.4028234663852886E38)., the algorithm won't converge to the correct answer of 18446743523953729536.000000 This algorithm's answer gives 18446743523953737728.000000.

I am using MATLAB to implement my code before I implement this in hardware. I can only use single precision floating point values, (SO NO DOUBLES).

clc; clear; close all;

% Input
R = typecast(uint32(hex2dec(num2str(dec2hex(((hex2dec('7F7FFFFF'))))))),'single')

% Initial estimate
OneOverRoot2 = single(1/sqrt(2));
Root2 = single(sqrt(2));

% Get low and high bits of input R
hexdata_high = bitand(bitshift(hex2dec(num2hex(single(R))),-16),hex2dec('ffff'));
hexdata_low = bitand(hex2dec(num2hex(single(R))),hex2dec('ffff'));

% Change exponent of input to -1 to get Mantissa
temp = bitand(hexdata_high,hex2dec('807F'));
Expo = bitshift(bitand(hexdata_high,hex2dec('7F80')),-7);
hexdata_high = bitor(temp,hex2dec('3F00'));
b = typecast(uint32(hex2dec(num2str(dec2hex(((bitshift(hexdata_high,16)+ hexdata_low)))))),'single');

% If exponent is odd ...
if (bitand(Expo,1))
    % Pretend the mantissa [0.5 ... 1.0) is multiplied by 2 as Expo is odd,
    %   so it now has the value [1.0 ... 2.0)
    % Estimate the sqrt(mantissa) as [1.0 ... sqrt(2))
    % IOW: linearly map (0.5 ... 1.0) to (1.0 ... sqrt(2))
    Mantissa = (Root2 - 1.0)/(1.0 - 0.5)*(b - 0.5) + 1.0;
else
    % The mantissa is in range [0.5 ... 1.0)
    % Estimate the sqrt(mantissa) as [1/sqrt(2) ... 1.0)
    % IOW: linearly map (0.5 ... 1.0) to (1/sqrt(2) ... 1.0)
    Mantissa = (1.0 - OneOverRoot2)/(1.0 - 0.5)*(b - 0.5) + OneOverRoot2;
end

newS = Mantissa*2^(bitshift(Expo-127,-1));
S=newS

% S = (S + R/S)/2 method
for j = 1:6 
    fprintf('S  %u %f %f\n', j, S, (S-sqrt(R)));
    S = single((single(S) + single(single(R)/single(S))))/2;
    S = single(S);
end

goodaccuracy =  (abs((single(S)-single(sqrt(single(R)))))) < 2^-23
difference = (abs((single(S)-single(sqrt(single(R))))))

% Get hexadecimal output
hexdata_high = (bitand(bitshift(hex2dec(num2hex(single(S))),-16),hex2dec('ffff')));
hexdata_low = (bitand(hex2dec(num2hex(single(S))),hex2dec('ffff')));
fprintf('FLOAT: T  Input: %e\t\tCorrect: %e\t\tMy answer: %e\n', R, sqrt(R), S);
fprintf('output hex = 0x%04X%04X\n',hexdata_high,hexdata_low);
out = hex2dec(num2hex(single(S)));

Neither 18446743523953729536 nor 18446743523953737728 is representable in single precision. The closest representable values are 18446742974197923840 and 18446744073709551616, respectively. So you will never get the mathematically correct result, and I do not see how you got the 18446743523953737728 result unless you are printing a double-precision value. — Eric Postpischil, Jul 10 '13 at 22:06
Okay, the correct result is 0x5F7FFFFF. This algorithm converges to 0x5F800000. Do you understand? — Veridian, Jul 10 '13 at 23:59
Those differ by one ULP. In general, an algorithm composed of multiple floating-point operations is not expected to return an exactly correct result. Even though the mathematical algorithm converges, the computational algorithm has multiple operations in the last iteration, so there is typically more error than required to guarantee a result rounded to the representable value nearest the exact result. Do you have reason for expecting the nearest representable value in this particular case? — Eric Postpischil, Jul 11 '13 at 00:13
Well I am using an algorithm that I want to be able to guarantee converges to the closest correct value so that the errors don't blow up in later computations. — Veridian, Jul 11 '13 at 00:25
Can you use double precision for the intermediate work, as long as the final result is returned in single precision? — Eric Postpischil, Jul 11 '13 at 00:32
Note that obtaining the nearest representable value merely reduces the error; it does not eliminate it, so that, by itself, does not guarantee that errors will not “blow up” in later computations. — Eric Postpischil, Jul 11 '13 at 00:33
@EricPostpischil, I cannot use double precision, my platform only has single precision floating point hardware and other 'general' instructions you would find in a MIPS RISC type ISA, so I can only use that. I understand that it will reduce the error, I need to be able to say that my algorithm correctly produces the square root with error less than 2^-23 — Veridian, Jul 11 '13 at 00:35
Do you have a fused multiply-add? Can you compute (r/s+s)/2 as s+(r-s*s)/s, where the numerator is computed using a fused multiply-add? — tmyklebu, Jul 11 '13 at 02:05
The java Math.sqrt() method does round this correctly - the double result is 0x43efffffeffffffc, which rounded down to float is 0x5f7fffff. The API documentation calls for correct rounding of the sqrt results. See the answers to (http://stackoverflow.com/questions/825221/where-can-i-find-the-source-code-for-javas-square-root-function) for the reference implementation. The code will need modification for 32-bit rather than 64-bit. — Patricia Shanahan, Jul 11 '13 at 14:06
@PatriciaShanahan, Thank you for the recommendation, I'll take a look at it. — Veridian, Jul 11 '13 at 15:04

tmyklebu · Accepted Answer · 2013-07-13T20:43:16.603

I took a whack at this. Here's what I came up with:

float mysqrtf(float f) {
  if (f < 0) return 0.0f/0.0f;
  if (f == 1.0f / 0.0f) return f;
  if (f != f) return f;

  // half-ass an initial guess of 1.0.
  int expo;
  float foo = frexpf(f, &expo);
  float s = 1.0;
  if (expo & 1) foo *= 2, expo--;

  // this is the only case for which what's below fails.
  if (foo == 0x0.ffffffp+0) return ldexpf(0x0.ffffffp+0, expo/2);

  // do four newton iterations.
  for (int i = 0; i < 4; i++) {
   float diff = s*s-foo;
    diff /= s;
    s -= diff/2;
  }

  // do one last newton iteration, computing s*s-foo exactly.
  float scal = s >= 1 ? 4096 : 2048;
  float shi = (s + scal) - scal; // high 12 bits of significand
  float slo = s - shi; // rest of significand
  float diff = shi * shi - foo; // subtraction exact by sterbenz's theorem
  diff += 2 * shi * slo; // opposite signs; exact by sterbenz's theorem
  diff += slo * slo;
  diff /= s; // diff == fma(s, s, -foo) / s.
  s -= diff/2;

  return ldexpf(s, expo/2);
}

The first thing to analyse is the formula (s*s-foo)/s in floating-point arithmetic. If s is a sufficiently good approximation to sqrt(foo), Sterbenz's theorem tells us that the numerator is within an ulp(foo) of the right answer --- all of that error is approximation error from computing s*s. Then we divide by s; this gives us at worst another half-ulp of approximation error. So, even without a fused multiply-add, diff is within 1.5 ulp of what it should be. And we divide it by two.

Notice that the initial guess doesn't in and of itself matter as long as you follow it up with enough Newton iterations.

Measure the error of an approximation s to sqrt(foo) by abs(s - foo/s). The error of my initial guess of 1 is at most 1. A Newton iteration in exact arithmetic squares the error and divides it by 4. A Newton iteration in floating-point arithmetic --- the kind I do four times --- squares the error, divides it by 4, and kicks in another 0.75 ulp of error. You do this four times and you find you have a relative error at most 0x0.000000C4018384, which is about 0.77 ulp. This means that four Newton iterations yield a faithfully-rounded result.

I do a fifth Newton step to get a correctly-rounded square root. The reason why it works is a little more intricate.

shi holds the "top half" of s while slo holds the "bottom half." The last 12 bits in each significand will be zero. This means, in particular, that shi * shi and shi * slo and slo * slo are exactly representable as floats.

s*s is within two ulps of foo. shi*shi is within 2047 ulps of s*s. Thus shi * shi - foo is within 2049 ulps of zero; in particular, it's exactly representable and less than 2^-10.

You can check that you can add 2 * shi * slo and get an exactly-representable result that's within 2^-22 of zero and then add slo*slo and get an exactly representable result --- s*s-foo computed exactly.

When you divide by s, you kick in an additional half-ulp of error, which is at most 2^-48 here since our error was already so small.

Now we do a Newton step. We've computed the current error correctly to within 2^-46. Adding half of it to s gives us the square root to within 3*2^-48.

To turn this into a guarantee of correct rounding, we need to prove that there are no floats between 1/2 and 2, other than the one I special-cased, whose square roots are within 3*2^-48 of a midpoint between two consecutive floats. You can do some error analysis, get a Diophantine equation, find all of the solutions of that Diophantine equation, find which inputs they correspond to, and work out what the algorithm does on those. (If you do this, there is one "physical" solution and a bunch of "unphysical" solutions. The one real solution is the only thing I special-cased.) There may be a cleaner way, however.

Thank you for your help, I'm examining your code now. I'm eager to hear from you later. much appreciated! — Veridian, Jul 11 '13 at 20:22
It appears that I only needed the if (foo == 0x0.ffffffp+0) return ldexpf(0x0.ffffffp+0, expo/2); part to make my code work. It seems my algorithm is more efficient? I need only 3 newton iterations. Also, I don't need the extra newton iteration you perform. I would like your confirmation that you see this as well though — Veridian, Jul 11 '13 at 21:26
@starbox: I don't get the same results as you with the `s = (s+foo/s)/2` update. I actually see tons and tons of problems when I run it. Using an initial guess that doesn't suck will reduce the number of necessary Newton iterations, yes. I didn't do it because I'm lazy, not because it's useless. — tmyklebu, Jul 11 '13 at 22:01
Hmm, my original algorithm is having problems with the precision. I thought matlab was giving me correct results. after I split my operations out farther with the single() command, it turns out I'm off by a pit, for instance with input 0x60b8f877 I'm getting 0x5019deee instead of 0x5019deef. Will your algorithm work for this value? — Veridian, Jul 13 '13 at 00:34
@starbox: Newton's method is folklore. As for the Diophantine equations, you can work it out directly. Write down "`x` is a `float` that makes my algorithm fail" mathematically. Then you can reason about what `x` must be and how close to a midpoint between two floats `sqrt(x)` must be. You'll get a handful of Diophantine equations modulo 2^24 that tell you what the biggest integer less than `sqrt(x)` will be. — tmyklebu, Jul 15 '13 at 05:44
@starbox: Is that perchance because it failed to handle 0 correctly? No other cases are mishandled. — tmyklebu, Jul 15 '13 at 20:58
@starbox: Show a failing case --- the input and the expected output. — tmyklebu, Jul 15 '13 at 21:19
Input: 3FE66666, Correct Output: Correct value = 3FABBAE2 My result = 3FABBAE3. — Veridian, Jul 15 '13 at 21:21
Works for me. I get 0x1.5775c4p+0 == 3fabbae2. Perchance you converted the code incorrectly. — tmyklebu, Jul 15 '13 at 21:26
nm, I got it to work. Sorry about that. What is the formula for calculating error based upon number of newton iterations. — Veridian, Jul 15 '13 at 23:25

How to fix this floating point square root algorithm

1 Answers1

Linked