2

I'm attempting to create a fixed-point square root function for a Xilinx FPGA (hence real types are out, and David Bishops ieee_proposed library is also unsupported for XST synthesis).

I've settled on a Newton-Raphson method to calculate the reciprocal square root (as it involves fewer divisions).

One of the remaining dilemmas I have is how to generate the initial seed. I looked at the Fast Inverse Square Root, but it only appears to work for floating point arithmetic.

My best thoughts at the moment are, to take the length of the input value (ie. find the index of the most significant, non-zero bit), halve it crudely and use that as the power for a seed.

I wrote a short test script to quickly check the accuracy (its in Matlab but that's just so I could plot a graph...)

x = 1:2^24;
gen_result = zeros(1,length(x));
seed_vals = zeros(1,length(x));
for i = 1:length(x)
   result = 2^-ceil(log2(x(i))/2); %effectively creates seed value from top bit index
   seed_vals(i) = 1/result; %Store seed value
   for j = 1:6
       result = result*(1.5-0.5*x(i)*result^2); %reciprocal root
   end
   gen_result(i) = 1/result; %single division at the end
end

And unsurprisingly, the seed becomes wildly inaccurate each time a number increases in size, and this increases as the magnitude of the input increases. As a graph this can be seen as:

enter image description here

The red line is the value of the seed, and as can be seen, is increasing increasing in powers of 2.

My question very simple: Are there any other simple methods I could use to generate a seed value for fixed point square root values in VHDL, ideally which don't cause ever increasing amounts of inaccuracy (and hence require more iterations each time the input increases in size).

Any other incidental advise on how to approach finding fixed points square roots in VHDL would be gratefully received!

davidhood2
  • 1,367
  • 17
  • 47
  • 2
    You might want to add a few constraints, such as what fixed-point format you intend to use, how accurate the initial guess needs to be, and whether a lookup table (e.g. 8-bit) would be suitable. Your basic approach of first normalizing the input (via priority encoder) is fine, however you would want to look at two successive input binades, e.g. [1,4) and determine the reciprocal square root across that. The output will then require a single binade, and you can scale (by simple shift, based on the priority encoder value) the intermediate result as appropriate to get the final result. – njuffa Mar 08 '16 at 00:25
  • Two approaches are 1) to store a lookup table indexed on high order bits and 2) to use a polynomial approximation. E.g. see http://www.matematicasvisuales.com/english/html/analysis/taylor/sqrt1plus0Taylor.html, but I think optimal approximations for the range [0..1] have been worked out. See also https://studentnet.cs.manchester.ac.uk/resources/library/thesis_abstracts/MSc12/FullText/Moise-Mircea-fulltext.pdf which advocates normalizing to range [.5..1] and using a linear approximation for initialization. – Gene Mar 08 '16 at 00:26
  • If you can read annotated x86 assembly language, [this answer](http://stackoverflow.com/a/15183580/780717) regarding a fixed-point implementation of reciprocal square root may help. Here is a [similar answer](http://stackoverflow.com/a/32337283/780717), using C code. – njuffa Mar 08 '16 at 00:27
  • 1
    take a look at [binary search SQRT without multiplication](http://stackoverflow.com/a/34657972/2521214) it might interests you – Spektre Mar 08 '16 at 13:09
  • @njuffa So, in laymans terms - are you saying, divide it down by powers of 2 so that it lies between 1 & 0, then use a LUT to get 8 bits of accuracy, run 2 rounds of newton raphson and scale back up again? (I'm relatively new to hardware) – davidhood2 Mar 08 '16 at 18:11
  • @davidhood2 You can normalize to any pair of binades you desire. One choice would be [0.25,1), which means you are dealing with a purely fractional fixed-point number after normalization (plus a scale factor later needed to shift the final result). You can use a LUT, but don't have to. You can also use a minimax polynomial approximation (as was already pointed out by Gene above). If you can't afford multiplies, look at implementing a bit-wise square root instead, as suggested by Spektre above. – njuffa Mar 08 '16 at 18:28
  • 1
    Also, if you think you would have any use for the Fast Inverse Square Root, check it out again. There are some float datatypes in it, but there is not any significant amount of actual floating point math in it. It should be fairly straight forward to convert it to pure integer math. The most complicated thing that you would need to implement is the int<->float conversion, which is not very difficult. – Timmy Brolin Mar 08 '16 at 22:32
  • `the seed becomes wildly inaccurate each time a number increases in size` The relative error does not change. Use the length of `.7*x` for an easy improvement. – greybeard Feb 18 '18 at 14:38

1 Answers1

0

I realize this is an old question but I did end up here and this was kind of useful so I want to add my bit.

Assuming your Xilinx chip has an embedded multiplier, you could consider this approach to help get a better starting seed. The basic premise is to convert the input integer to fixed point with all fraction bits, and then use the embedded multiplier to scale half of your initial seed value by 0.X (which in hindsight is probably what people mean when they say "normalize to the region [0.5..1)", now that I think about it). It's basically piecewise linear interpolation of your existing seed method. The steps below should translate relatively easily to RTL, as they're just bit-shifts, adds, and one unsigned multiply.

1) Begin with your existing seed value (e.g. for x=9e6, you would generate s=4096 as the seed for your first guess with your "crude halving" method)

2) Right-shift the existing seed value by 1 to get the previous seed value (s_half = s >> 1 = 2048)

3) Left-shift the input until the most significant bit is a 1. In the event you are sqrting 32-bit ints, x_scale would then be 2304000000 = 0x89544000

4) Slice the upper e.g. 18 bits off of x_scale and multiply by an 18-bit version of s_half (I suggest 18 because I happen to know some Xilinx chips have embedded 18x18 multipliers). For this case, the result, x_scale(31 downto 14) = 140625 = 0x22551. At least, that's what the multiplier thinks - we're going to use fixed point so that it's actually 0b0.100010010101010001 = 0.53644 instead of 140625.

The result of this multiplication will be s_scale = s_half * x_scale(31 downto 14) = 2048 * 140625 = 288000000, but this output is in 18.18 format (18 integer bits, 18 fraction bits). Take the upper 18 bits, and you get s_scale(35 downto 18) = 1098

5) Add the upper 18 bits of s_scale to s_half to get your improved seed, in this case s_improved = 1098+2048 = 3146

Now you can do a few iterations of Newton-Raphson with this seed. For x=9e6, your crude halving approach would give an initial seed of 4096, the fixed-point scale outlined above gives you 3146, and the actual sqrt(9e6) is 3000. This value is half-way between your seed steps, and my napkin math suggests it saved about 3 iterations of Newton-Raphson

ajs410
  • 2,384
  • 2
  • 19
  • 14