How to get a square root for 32 bit input in one clock cycle only?

Question

I want to design a synthesizable module in Verilog which will take only one cycle in calculating square root of given input of 32 bit.

If you really want just `1T` of clock then the only way I can think of is `2^32 or 2^31 or 2^30 x 2 Byte` precomputed LUT table in some ROM size `2-8GB` depend on signed/unsigned type and if you want to ignore the LSB. Any approach to compute this I know of needs few iterations (approximation) or up to `16T` for binary search. If different `1T` approach exist then is probably based on PCA and need different LUT tables. (sorry for the multiple edits of this comment ...) — Spektre, Jan 07 '16 at 10:04
i have to use this module many times so that will increase the overhead if i used LUT table — Mayur, Jan 07 '16 at 13:26
You need to compromise between speed and used gate numbers ... your best bet is approximation polynomial or binary search ... but both need to use multiplication as core function so you need fast multiplication `16b x 16b = 32b` too, but that can be done in just few `T` too (2-3T if I imagine it correctly). — Spektre, Jan 07 '16 at 13:30
I'm voting to close this question as off-topic because it's not related to Verilog — EML, Jan 07 '16 at 14:40
one clock cycle for the whole calculation or via a pipeline being able to average one result per cycle? — old_timer, Jan 07 '16 at 14:53
@EML most of the questions here on SO are not related to Verilog. — Henry, Jan 07 '16 at 15:24
@Henry I think that is because here are most people programmers not chip developers. For **FPGA** / chip design questions I think there would be more luck in electrical engineering, **DSP** and **MCU** oriented sites instead of here. — Spektre, Jan 07 '16 at 18:00
Maybe you can think of a parallelized/pipelined implementation of https://en.wikipedia.org/wiki/Methods_of_computing_square_roots#Binary_numeral_system_.28base_2.29 — , Feb 13 '16 at 16:57
If you don't want to sidestep _the cycle issue_ using a humongous self-timing or combinatorial circuit, table lookup seems the way to go. `input of 32 bit` doesn't much to specify number representation (Spektre seems to assume integer). Using linear approximation, the table(s) needed shouldn't be quite as large. — greybeard, May 01 '16 at 17:23

score 2 · Answer 1 · edited Jun 20 '20 at 09:12

[Edit1] repaired code

Recently found the results where off even if tests determine all was OK so I dig deeper and found out that I had a silly bug in my equation and due to name conflicts with my pgm environment the tests got false positives so I overlooked it before. Now it work in all cases as it should.

The best thing I can think of (except approximation or large LUT) is binary search without multiplication, here C++ code:

//---------------------------------------------------------------------------
WORD u32_sqrt(DWORD xx) // 16 T
    {
    DWORD x,m,a0,a1,i;
    const DWORD lut[16]=
        {
        //     m*m
        0x40000000,
        0x10000000,
        0x04000000,
        0x01000000,
        0x00400000,
        0x00100000,
        0x00040000,
        0x00010000,
        0x00004000,
        0x00001000,
        0x00000400,
        0x00000100,
        0x00000040,
        0x00000010,
        0x00000004,
        0x00000001,
        };
    for (x=0,a0=0,m=0x8000,i=0;m;m>>=1,i++)
        {
        a1=a0+lut[i]+(x<<(16-i));
        if (a1<=xx) { a0=a1; x|=m; }
        }
    return x;
    }
//---------------------------------------------------------------------------

Standard binary search sqrt(xx) is setting bits of x from MSB to LSB so that result of x*x <= xx. Luckily we can avoid the multiplication by simply rewrite the thing as incrementing multiplicant... in each iteration the older x*x result can be used like this:

x1 = x0+m
x1*x1 = (x0+m)*(x0+m) = (x0*x0) + (2*m*x0) + (m*m)

Where x0 is value of x from last iteration and x1 is actual value. The m is weight of actual processed bit. The (2*m) and (m*m) are constant and can be used as LUT and bit-shift so no need to multiply. Only addition is needed. Sadly the iteration is bound to sequential computation forbid paralelisation so the result is 16T at best.

In the code a0 represents last x*x and a1 represents actual iterated x*x

As you can see the sqrt is done in 16 x (BitShiftLeft,BitShiftRight,OR,Plus,Compare) where the bit shift and LUT can be hardwired.

If you got super fast gates for this in comparison to the rest you can multiply the input clock by 16 and use that as internal timing for SQRT module. Something similar to the old days when there was MC clock as Division of source CPU clock in old Intel CPU/MCUs ... This way you can get 1T timing (or multiple of it depends on the multiplication ratio).

So... you changed the tags to create a legitimate question, and then answered the question in C++, when the OP wanted Verilog? Is that kosher? Perhaps you should now remove the Verilog tag? And what about the OP's request for a single-cycle solution - you're saying that 16 cycles can be one cycle if they're fast enough? — EML, Jan 07 '16 at 17:10
@EML As mentioned this does not really answer in `1T` anyway the language tag means that Answer is preferred in **Verilog (VHDL)** language but that does not mean this is strictly bound to that language. I could also draw the circuit instead C++ code but too lazy for that. I changed the tags because those I choose I feel are more important then the language itself for this task and should attract the right people for this. The circuit part is done only when you know what and how exactly it is done. But of coarse if you rewrite this to VHDL it will be more suited and probably get more votes. — Spektre, Jan 07 '16 at 17:54
Verilog and VHDL are different languages, both HDLs but VHDL does not stand for Verilog HDL. — Morgan, Jan 07 '16 at 23:06
Morgan is right, it stands for **V**HSIC **HDL** (an acronym inside an acronym, fantastic!), which stands for **V**ery **H**igh **S**peed **I**ntegrated **C**ircuit **H**ardware **D**escription **L**anguage... which may very well be the worst acronym ever. — Russell, Jan 08 '16 at 13:30
My version with variable bit count in. 2 to 32 bits even numbers — Richard Keene, Jul 18 '22 at 23:59

greybeard · Answer 2 · 2018-02-13T17:07:59.173

There is conversion to a logarithm, halving, and converting back.
For an idea how to implement "combinatorial" log and antilog, see Michael Dunn's EDN article showing priority encoder, barrel shifter & lookup table, with three log variants in System Verilog for down-load.
(Priority encoder, barrel shifter & lookup table look promising for "one-step-Babylonian/Heron/Newton/-Raphson. But that would probably still need a 128K by 9 bits lookup table.)

While not featuring "verilog",
Tole Sutikno: "An Optimized Square Root Algorithm for Implementation in FPGA Hardware" shows a combinatorial implementation of a modified (binary) digit-by-digit algorithm.

`log` approach sounds promising (+1) but hard to say if it is faster than mine or not as there are still iterations + LUTs would be nice to try it out on HW and compare... — Spektre, Nov 24 '19 at 08:42

score 1 · Answer 3 · answered Nov 24 '19 at 10:39

1

In 2018, T. Bagala, A. Fibich, M. Hagara, P. Kubinec, O. Ondráček, V. Štofanik and R. Stojanović authored Single Clock Square Root Algorithm Based on Binomial Series and its FPGA Implementation.

Local oscillator runs at 50MHz [… For 16 bit input mantissa,] Values from [the hardware] experiment were the same as values from simulation […] Obtained delay averages were 892ps and 906ps respectively.

(No explanation about the discrepancy between 50MHz and .9ns or the quoted ps resolution and the use of a 10Gsps scope. If it was about 18 cycles (due to pipelining rather than looping?)/~900*ns*, interpretation of Single Clock Square Root… remains open - may be one result per cycle.)
The paper discloses next no details about the evaluation of the binomial series.
While the equations are presented in a general form, too, my guess is that the amount of hardware needed for a greater number of bits gets prohibitive quickly.

answered Nov 24 '19 at 10:39

greybeard

2,249
8
30
66

Just found your new post +1 maybe the `0.9ps` is just the propagation delay between input and output ... and 50MHz is just the clock they used even if the limit is much higher (or limited by noise and HF problems instead of gate speed) do you have any link to this ? maybe even in original language from the names I assume Czech ... would love to read it ... – Spektre Sep 30 '20 at 07:37
I find it at [semanticscholar](https://www.semanticscholar.org/paper/Single-clock-square-root-algorithm-based-on-series-Bagala-Fibich/aa01955a551e744cba2c89a22c0688b5a63c6f69#paper-header); no idea currently where I saw the contents - I may have paid a visit to my alma mater. – greybeard Sep 30 '20 at 11:24
I found out this earlier: https://ieeexplore.ieee.org/document/8406022 but have no access however my coleage has and he sended the paper back to me just now it looks much better than yours (from understandability point) looks like they are exploiting: `sqrt(1+x) = 1 + (x^1)/2 - (x^2)/8 + (x^3)/16 - 5*(x^4)/128 + ....` and the fact that already computed subresults digits are not changing ... – Spektre Sep 30 '20 at 11:24

score 0 · Answer 4 · answered Jan 07 '16 at 19:12

0

I got the code here it is

    module sqrt(
input[31:0]a,
output[15:0]out
    );
reg [31:0]temp;
reg[14:0]x;

always@(a)
begin
if(a<257)x=4;
if(a>256 && a<65537)x=80;
if(a>65536 && a<16777217)x=1000;
if(a>16777216 && a<=4294967295)x=20000;
temp=(x+(a/x))/2;
temp=(temp+(a/temp))/2;
temp=(temp+(a/temp))/2;
temp=(temp+(a/temp))/2;
temp=(temp+(a/temp))/2;
temp=(temp+(a/temp))/2;
temp=(temp+(a/temp))/2;
end

assign out=temp;
endmodule

answered Jan 07 '16 at 19:12

Mayur

27
1
3

task sqrt; input[31:0]a; output[15:0]out; reg [31:0]temp,temp1,temp2,temp3,temp4,temp5,temp6; reg [14:0]x; begin if(a<257)x=4; if(a>256 && a<65537)x=80; if(a>65536 && a<16777217)x=1000; if(a>16777216 && a<=4294967295)x=20000; temp=(x+(a/x))/2; temp1=(temp+(a/temp))/2; temp2=(temp1+(a/temp1))/2; temp3=(temp2+(a/temp2))/2; temp4=(temp3+(a/temp3))/2; temp5=(temp4+(a/temp4))/2; temp6=(temp5+(a/temp5))/2; out=temp6; – Mayur Feb 13 '16 at 16:39
3

This comment is not an explanation. If you want to post an updated solution, just click on **edit** below the _this_ answer. – Martin Zabel Feb 13 '16 at 17:50
Doesn't use of divide operation seven times (`a/temp`) make this implementation virtually unusable? My understanding is that true hardware implementations for integer sqrt and integer div are implemented using similar algorithms (testing bits in a result from highest one to lowest one) and should have similar latency. – gudok May 01 '16 at 08:00
1

Basically Newtons method: https://en.wikipedia.org/wiki/Methods_of_computing_square_roots – Ray May 01 '16 at 15:25
indent your code properly, and edit it to add explanation. Code in comments should be put between backticks `\`likethis\`` to make it readable – phuclv Jul 01 '16 at 06:13

Brian Magnuson · Answer 5 · 2018-02-13T14:31:19.397

0

The usual means of doing this in hardware is using a CORDIC. A general implementation allows the calculation of a variety of transcendental functions (cos/sin/tan) and... square roots depending on how you initialize and operate the CORDIC.

It's an iterative algorithm so to do it in a single cycle you'd unroll the loop into as many iterations as you require for your desired precision and chain the instances together.

Specifically if you operated the CORDIC in vectoring mode, initialize it with [x, 0] and rotate to 45 degrees the [x', y'] final output will be a multiplicative constant away. i.e. sqrt(x) = x' * sqrt(2) * K

edited Feb 13 '18 at 14:31

answered Feb 13 '18 at 14:11

Brian Magnuson

1,487
1
9
15

What about `one clock cycle only?` – greybeard Feb 13 '18 at 14:53
If you unroll the loop it is one cycle as I stated. So if you had a block that did one iteration of the algorithm you'd simple chain inputs to outputs (stage->stage->stage->....). Each stage gets closer to the answer in a predictable way so you simple have as many stages as your desired precision requires. – Brian Magnuson Feb 13 '18 at 14:57
1

Making for a horribly long delay and slow clock, as every other iterative algorithm converted to a combinatorial circuit. (With kind of an exception for *exactly **one** iteration*.) – greybeard Feb 13 '18 at 15:00
Uh, ok. The question was how to calculate sqrt in a single cycle. All the answers here are either a) iterative or b) Use dividers. Square root is a non-trivial math operation in hardware. You get to pick small (options presented above) or fast (big ass precomputed table). You can't have both. – Brian Magnuson Feb 13 '18 at 15:08

score 0 · Answer 6 · answered Jul 19 '22 at 00:00

My version of Spektre with variable bits count in so it can be faster on short squares.

    const unsigned int isqrt_lut[16] =
{
    //     m*m
    0x40000000,
    0x10000000,
    0x04000000,
    0x01000000,
    0x00400000,
    0x00100000,
    0x00040000,
    0x00010000,
    0x00004000,
    0x00001000,
    0x00000400,
    0x00000100,
    0x00000040,
    0x00000010,
    0x00000004,
    0x00000001,
};
/// Our largest golf ball image is about 74 pixels, so lets round up to power of 2 and we get 128.
/// 128 squared is 16384 so out largest sqrt has to handle 16383 or 14 bits. Only positive values.
/// ** maxBits in is 2 to 32 always an even number **
/// Input value mist always be less than (2^maxBits) - 1
unsigned int isqrt(unsigned int xx, int maxBitsIn) {
    DWORD x, m, a0, a1, i;

    for (x = 0, a0 = 0, m = 0x01 << (maxBitsIn / 2 - 1), i = 16 - maxBitsIn / 2; m; m >>= 1, i++)
    {
        a1 = a0 + isqrt_lut[i] + (x << (16 - i));
        if (a1 <= xx) {
            a0 = a1;
            x |= m;
        }
    }
    return x;
}

[Did you yourself write the code presented](https://stackoverflow.com/help/referencing)? What about `one clock cycle only`? — greybeard, Jul 19 '22 at 04:42
It is Spekres' code. I adapted it to take a bit precision input. On can not get the sqrt in a single clock cycle at 3GHz. One "single clock" way is to just have a huge lookup table but that has memory access issues that slow it down. If the ALU in the CPU has a sqrt instruction it can be very fast, but still not one clock cycle. — Richard Keene, Jul 21 '22 at 16:01

How to get a square root for 32 bit input in one clock cycle only?

6 Answers6

Linked