Algorithm for square root calculation

Question

I have been implementing control software in C and one of the control algorithms requires square root calculation. I have been looking for suitable square root calculation algorithm which will have constant execution time irrespective to the radicand value. This requirement rules out the sqrt function from the standard library.

As far as my platform I have been working with floating point 32 bits ARM Cortex A9 based machine. As far as the radicand range in my application the algorithms are calculated in physical units so I expect following range <0, 400>. As far as the required error I think that error about 1 % could be sufficient. Can anybody recommend me a square root calculation algorithm suitable for my purposes?

I suggest just using a 400-element lookup table to evaluate sqrt(floor(x)) in constant time, followed by 1 or 2 iterations of Newton's method -- however many suffices to give you sufficient accuracy in the worst case. — j_random_hacker, May 13 '21 at 06:55
so: 1. you want 32/64 bit integer/floating/fixed point sqrt? 2. what operations you got at disposal ? I assume you do not have FPU. 3. What are the constraints RAM/ROM memory , number of iterations ... I would start with binary search without multiplication like this [integer sqrt](https://stackoverflow.com/a/34657972/2521214) and convert it to float/fixed if needed (by precmputing exponent, and having mantissa a bit bigger for easy adjustment of final result normalization step). Also see [Power by squaring for negative exponents](https://stackoverflow.com/a/30962495/2521214) for inspiration — Spektre, May 13 '21 at 07:22
What exactly do you mean by "constant time"? Essentially nothing is constant time on a modern processor. For example, the speed of a lookup table will depend on whether the relevant line is cached or not. Branch instructions run faster or slower based on the cpu predicting the branch, which depends on execution history. "Constant time" is used in cryptography to mean something that doesn't have timing attacks -- is that what you want? — Paul Hankin, May 13 '21 at 08:02
(An entirely similar question came up on [codereview@SE](https://codereview.stackexchange.com/q/260559).) Please provide more context - e.g., if the square roots are used for comparison with other values, only, consider comparing the original value to the square of the other. — greybeard, May 13 '21 at 09:54

Support Ukraine · Answer 1 · 2021-05-13T08:51:25.803

My initial approach would be to use the Taylor serie for square root with precalculated coefficients at a number of fixed points. This will reduce the calculation to a subtraction and a number of multiplication.

The look-up table would be a 2D array like:

point | C0  | C1  | C2  | C3  | C4  | ...
-----------------------------------------
 0.5  | f00 | f01 | f02 | f03 | f04 |
-----------------------------------------
 1.0  | f10 | f11 | f12 | f13 | f14 |
-----------------------------------------
 1.5  | f20 | f21 | f22 | f23 | f24 |
-----------------------------------------
....

So when calculating sqrt(x) use the table row with the point closest to x.

Example:

sqrt(1.1) (i.e. use point 1.0 coeffients)

f10 + 
f11 * (1.1 - 1.0) + 
f12 * (1.1 - 1.0) ^ 2 + 
f13 * (1.1 - 1.0) ^ 3 + 
f14 * (1.1 - 1.0) ^ 4

The table above suggest a fixed distance between the points at which you precalculate coeffients (i.e. 0.5 between each point). However, due to the natur of square root you may find that the distance between points shall differ for different ranges of x. For instance x in [0 - 1] -> distance 0.1,x in [1 - 2] -> distance 0.25, x in [2 - 10] -> distance 0.5 and so on.

Another thing is the number of terms needed to get the desired precision. Here you may also find that different ranges of x may require a different number of coefficients.

All this is easy to precalculation on a normal computer (e.g. using excel).

Note: For values very close to zero this method isn't good. Maybe Newtons method will be a better choice.

Taylor series: https://en.wikipedia.org/wiki/Taylor_series

Newtons method: https://en.wikipedia.org/wiki/Newton%27s_method

Also relevant: https://math.stackexchange.com/questions/291168/algorithms-for-approximating-sqrt2

Aki Suihkonen · Answer 2 · 2021-05-13T09:49:58.297

Arm v7 instruction set provides a fast instruction for inverse reciprocal square root calculation vrsqrte_f32 for two simultaneous approximations and vrsqrteq_f32 for four approximations. (The scalar variant vrsqrtes_f32 is only available on Arm64 v8.2).

Then the result can be simply calculated by x * vrsqrte_f32(x);, which has better than 0.33% relative accuracy over the whole range of positive values x. See https://www.mdpi.com/2079-3197/9/2/21/pdf

ARM NEON instruction FRSQRTE gives 8.25 correct bits of the result.

At x==0 vrsqrtes_f32(x) == Inf, so x*vrsqrtes_f32(x) would be NaN.

If the value of x==0 is unavoidable, the optimal two instruction sequence needs a bit more adjustment:

float sqrtest(float a) {
    // need to "transfer" or "convert" the scalar input 
    // to a vector of two
    // - optimally we would not need an instruction for that
    // but we would just let the processor calculate the instruction
    // for all the lanes in the register
    float32x2_t a2 = vdup_n_f32(a);

    // next we create a mask that is all ones for the legal
    // domain of 1/sqrt(x)
    auto is_legal = vreinterpret_f32_u32(vcgt_f32(a2, vdup_n_f32(0.0f)));

    // calculate two reciprocal estimates in parallel 
    float32x2_t a2est = vrsqrte_f32(a2);

    // we need to mask the result, so that effectively
    // all non-legal values of a2est are zeroed
    a2est = vand_u32(is_legal, a2est);

    // x * 1/sqrt(x) == sqrt(x)
    a2 = vmul_f32(a2, a2est);

    // finally we get only the zero lane of the result
    // discarding the other half
    return vget_lane_f32(a2, 0);
}

Surely this method will have almost twice the throughput with

void sqrtest2(float &a, float &b) {
    float32x2_t a2 = vset_lane_f32(b, vdup_n_f32(a), 1);
    float32x2_t is_legal = vreinterpret_f32_u32(vcgt_f32(a2, vdup_n_f32(0.0f)));
    float32x2_t a2est = vrsqrte_f32(a2);
    a2est = vand_u32(is_legal, a2est);
    a2 = vmul_f32(a2, a2est);
    a = vget_lane_f32(a2,0); 
    b = vget_lane_f32(a2,1); 
}

And even better, if you can work directly with float32x2_t or float32x4_t inputs and outputs.

float32x2_t sqrtest2(float32x2_t a2) {
    float32x2_t is_legal = vreinterpret_f32_u32(vcgt_f32(a2, vdup_n_f32(0.0f)));
    float32x2_t a2est = vrsqrte_f32(a2);
    a2est = vand_u32(is_legal, a2est);
    return vmul_f32(a2, a2est);
}

This implementation gives sqrtest2(1) == 0.998 and sqrtest2(400) == 19.97 (tested on MacBook M1 with arm64). Being branchless and LUT free, this has likely a constant execution time, assuming that all the instructions execute in constant number of cycles.

score 0 · Answer 3 · answered May 14 '21 at 07:12

I have decided to use following approach. I have chosen the Newton method and then I have experimentally set the fixed number of iterations so that the error in whole range of the radicand i.e. <0,400> doesn't exceed the prescribed value. I have ended at six iterations. As far as the radicand with value 0 I have decided to return 0 without any calculations.

Algorithm for square root calculation

3 Answers3