10

I am currently searching for a very fast integer square root approximation, where floor(sqrt(x)) <= veryFastIntegerSquareRoot(x) <= x

The square root routine is used for calculating prime numbers, which get considerably faster if only values below or equals sqrt(x) are checked for being a divisor of x.

What I am currently having is this function from Wikipedia, adjusted a small bit to work with 64-bit integers.

Because I have no other function to compare against (or more precise, the function is too precise for my purposes, and it probably takes more time, than being higher than the actual result.)

chqrlie
  • 131,814
  • 10
  • 121
  • 189
Morten
  • 143
  • 1
  • 1
  • 10
  • 2
    Did you profile your code? If you're doing trial division, the sqrt operation is quite unlikely to be the bottleneck. – user2357112 Dec 09 '15 at 19:19
  • The jump-free Newton-Raphson converges in just four steps, IIRC. (for 32 bit ints) – wildplasser Dec 09 '15 at 19:24
  • @AntoineMathys: sqrt only has to be computed once, though. If you do `i * i <= x`, that's an extra multiplication every iteration. – user2357112 Dec 09 '15 at 19:29
  • the complexity still looks like O(n log n) or O(log n). Finding the prime numbers up to 10000000 takes roughly 40 seconds on a pi 2, while finding them up to 1000000 takes 2. – Morten Dec 09 '15 at 19:29
  • @AntoineMathys I just tested it out. I am assigning `sqrt(x)` to a variable (of course) and the loop for up to 10000000 (when the innermost loop is empty) takes 4 seconds as opposed to the constant re-evaluation of `i*i`, which takes over two minutes. – Morten Dec 09 '15 at 19:39
  • @AntoineMathys The `i * i <= x` approach has trouble when `x` is near `(U)INT_MAX` and `i*i` overflows. – chux - Reinstate Monica Dec 09 '15 at 20:49
  • This might be helpful if performance is the only concern: calculate isqrt first and calculate sqrt(x)=x*isqrt(x) https://en.wikipedia.org/wiki/Fast_inverse_square_root This method is used by some DSP's math library. – user3528438 Dec 09 '15 at 23:20
  • @user3528438 This, however, doesn't fit my requirements (bigger or equal than the actual value). The prime number calculation can loose accuracy with that, because it could consider 9 as prime number, because `i*isqrt(i)` returns 2. – Morten Dec 10 '15 at 06:02
  • I understand this is bit off-topic, but x64_64 has sqrt instruction for double. May be will be faster to use it and then cast to int? – Nick Aug 17 '20 at 15:47
  • [for the record] Several comments above refer to (I presume) a since-deleted comment by AntoineMathys pointing out that when trying trial divisors `i` while factoring `x`, you can replace the test `i < sqrt(x)` with `i*i < x`, thus eliminating the square-root computation completely — albeit at other costs. – Steve Summit Dec 20 '21 at 23:41

7 Answers7

7

Loopfree/jumpfree (well: almost ;-) Newton-Raphson:

/* static will allow inlining */
static unsigned usqrt4(unsigned val) {
    unsigned a, b;

    if (val < 2) return val; /* avoid div/0 */

    a = 1255;       /* starting point is relatively unimportant */

    b = val / a; a = (a+b) /2;
    b = val / a; a = (a+b) /2;
    b = val / a; a = (a+b) /2;
    b = val / a; a = (a+b) /2;

    return a;
}

For 64-bit ints you will need a few more steps (my guess: 6)

chqrlie
  • 131,814
  • 10
  • 121
  • 189
wildplasser
  • 43,142
  • 8
  • 66
  • 109
  • I remember a good way is to calculate inverse square root first. This solution may not perform well because of the division. I believe most machines still need 10-50 cycles to calculate a 32bit integer division. – user3528438 Dec 09 '15 at 22:58
  • Division speed may be a factor on some architectures. It is fast on x86, (for x >=3, it used to cost 20...60 clocks on 8086, IIRC). Avoiding loops and jumps, (keeping the insn pipeline full) should work on modern intel. – wildplasser Dec 09 '15 at 23:05
  • I just benchmarked, and the `a = sqrt (0.0 +val);` version is even a bit faster. Still needs an FPU or mmx, though. GCC emits `sqrtsd %xmm0, %xmm1` – wildplasser Dec 09 '15 at 23:36
  • is the inaccuracy using floating point numbers **guaranteed** to be larger? I don't think so. – Morten Dec 10 '15 at 05:42
  • Your code has about double speed. Now testing x*isqrt(); The Inverse square root thingy has four times the speed. – Morten Dec 10 '15 at 05:51
  • I would call this implementation relatively useless. Even a simple test case fails: `usqrt(2 * 2) != 2`. If I wanted limited precision, I would use floating point numbers, not integers. – Jan Schultke Aug 17 '20 at 13:33
  • 2
    To be precise `usqrt4(4) == 78`, so it fails the requirement that the result is a lower bound. – Jan Schultke Aug 17 '20 at 14:35
  • @JanSchultke: The OP actually wants the result to be at least as large as `floor(sqrt(n))`: *I am currently searching for a very fast integer square root approximation, where floor(sqrt(x)) <= veryFastIntegerSquareRoot(x) <= x*. The requirement that fails is `veryFastIntegerSquareRoot(x) <= x` because `usqrt(4) > 4`. As a matter of fact the requirement should be `veryFastIntegerSquareRoot(x) < x` otherwise the prime test running up to and including `veryFastIntegerSquareRoot(x)` will attempt to divide `x` by itself, and incorrectly report `x` as composite. Your answer is far better. – chqrlie Sep 06 '20 at 06:31
  • @Morten: This answer has problems and cannot be used for your purpose. You should instead accept Jan Schultke's answer. – chqrlie Sep 06 '20 at 06:32
3

Computing floor(sqrt(x)) Exactly

Here is my solution, which is based on the bit-guessing approach proposed on Wikipedia. Unfortunately the pseudo-code provided on Wikipedia has some errors so I had to make some adjustments:

unsigned char bit_width(unsigned long long x) {
    return x == 0 ? 1 : 64 - __builtin_clzll(x);
}

// implementation for all unsigned integer types
unsigned sqrt(const unsigned n) {
    unsigned char shift = bit_width(n);
    shift += shift & 1; // round up to next multiple of 2

    unsigned result = 0;

    do {
        shift -= 2;
        result <<= 1; // leftshift the result to make the next guess
        result |= 1;  // guess that the next bit is 1
        result ^= result * result > (n >> shift); // revert if guess too high
    } while (shift != 0);

    return result;
}

bit_width can be evaluated in constant time and the loop will iterate at most ceil(bit_width / 2) times. So even for a 64-bit integer, this will be at worst 32 iterations of basic arithmetic and bitwise operations.

Unlike all other answers proposed so far, this actually gives you the best possible approximation, which is floor(sqrt(x)). For any x2, this will return x exactly.

Making a Guess using log2(x)

If this is still too slow for you, you can make a guess solely based on the binary logarithm. The basic idea is that we can compute the sqrt of any number 2x using 2x/2. x/2 might have a remainder, so we can't always compute this exactly, but we can compute an upper and lower bound.

For example:

  1. we are given 25
  2. compute floor(log_2(25)) = 4
  3. compute ceil(log_2(25)) = 5
  4. lower bound: pow(2, floor(4 / 2)) = 4
  5. upper bound: pow(2, ceil(5 / 2)) = 8

And indeed, the actual sqrt(25) = 5. We found sqrt(16) >= 4 and sqrt(32) <= 8. This means:

4 <= sqrt(16) <= sqrt(25) <= sqrt(32) <= 8
            4 <= sqrt(25) <= 8

This is how we can implement these guesses, which we will call sqrt_lo and sqrt_hi.

// this function computes a lower bound
unsigned sqrt_lo(const unsigned n) noexcept
{
    unsigned log2floor = bit_width(n) - 1;
    return (unsigned) (n != 0) << (log2floor >> 1);
}

// this function computes an upper bound
unsigned sqrt_hi(const unsigned n)
{
    bool isnt_pow2 = ((n - 1) & n) != 0; // check if n is a power of 2
    unsigned log2ceil = bit_width(n) - 1 + isnt_pow2;
    log2ceil += log2ceil & 1; // round up to multiple of 2
    return (unsigned) (n != 0) << (log2ceil >> 1);
}

For these two functions, the following statement is always true:

sqrt_lo(x) <= floor(sqrt(x)) <= sqrt(x) <= sqrt_hi(x) <= x

Note that if we assume that the input is never zero, then (unsigned) (n != 0) can be simplified to 1 and the statement is still true.

These functions can be evaluated in O(1) with a hardware-__builtin_clzll instruction. They only give precise results for numbers 22x, so 256, 64, 16, etc.

Jan Schultke
  • 17,446
  • 6
  • 47
  • 96
  • Your solutions are very interesting as they do not use any divisions. It would be useful to benchmark them against the classic Newton-Raphson method, but on accuracy and performance. – chqrlie Aug 17 '20 at 16:11
  • @chqrlie it performs roughly en-par with Newton's method when using `sqrt_lo` as the initial guess. [See this post for details](https://stackoverflow.com/a/63457507/5740428). – Jan Schultke Aug 17 '20 at 20:50
  • I wonder if converting to float and using the exponent bits could be also used on processors that can handle floating points but have no bit-counting instructions (if such processors exist). – biziclop May 14 '23 at 18:10
  • 1
    @biziclop such processors do exist. Afaik, ARMv5T is the first ARM processor with the needed instruction (`clz`), but earlier ARM processors had single-precision floating point support. – Jan Schultke Jun 11 '23 at 22:57
1

This version can be faster as DIV is slow and for small numbers (Val<20k) this version reduces the error to less than 5%. Tested on ARM M0 (With no DIV hardware acceleration)

static uint32_t usqrt4_6(uint32_t val) {
    uint32_t a, b;

    if (val < 2) return val; /* avoid div/0 */
    a = 1255;       /* starting point is relatively unimportant */
    b = val / a; a = (a + b)>>1;
    b = val / a; a = (a + b)>>1;
    b = val / a; a = (a + b)>>1;
    b = val / a; a = (a + b)>>1;
    if (val < 20000) {  
        b = val / a; a = (a + b)>>1;    // < 17% error Max
        b = val / a; a = (a + b)>>1;    // < 5%  error Max
    }
    return a;
}
chqrlie
  • 131,814
  • 10
  • 121
  • 189
  • 1
    What DIV are you talking about? No self-respecting compiler will translate `/ 2` into a DIV on platform where `>> 1` is faster. There's never any reason to replace unsigned `/ 2` with `>> 1` in C code. – AnT stands with Russia Mar 28 '20 at 21:20
  • @AnT: I suppose Mindy was referring to the test `if (val < 20000)` which saves 2 divisions. I wonder whether this test should be `val > 20000` instead of `val < 20000`. – chqrlie Aug 17 '20 at 15:46
1

On modern PC hardware, computing the square root of n may well be faster using floating point arithmetics than any kind of fast integer math, but as for all performance issues, careful benchmarking is required.

Note however that computing the square root may not be required at all: you can instead square the loop index and stop when the square exceeds the value of n. The dominant operation is the division in the loop body anyway:

#define PBITS32  ((1<<2) | (1<<3) | (1<<5) | (1<<7) | (1<<11) | (1<<13) | \
                  (1UL<<17) | (1UL<<19) | (1UL<<23) | (1UL<<29) | (1UL<<31))

int isprime(unsigned int n) {
    if (n < 32)
        return (PBITS32 >> n) & 1;
    if ((n & 1) == 0)
        return 0;
    for (unsigned int p = 3; p * p <= n; p += 2) {
        if (n % p == 0)
            return 0;
    }
    return 1;
}
chqrlie
  • 131,814
  • 10
  • 121
  • 189
  • Unlike exponentiation, doing this brute-force approach would take way too long for square roots. There are roughly 2^32 square roots to iterate over when brute-forcing the square root of a 64-bit integer. – Jan Schultke Aug 17 '20 at 15:03
  • @J.Schultke: what brute force approach are you referring to? The above code illustrates that square root is not needed for this loop. The goal is to determine the primeness of `n`, not to compute its square root. – chqrlie Aug 17 '20 at 15:24
  • I was not referring to the code, I was referring to what you said in the answer: *"you can instead square the candidates and stop when the square exceeds the value of n."*. And if this has nothing to do with square roots, why is it an answer to a square root question? – Jan Schultke Aug 17 '20 at 15:27
  • 1
    @J.Schultke: The OP wrote *The square root routine is used for calculating prime numbers, which get considerably faster if only values below or equals sqrt(x) are checked for being a divisor of x.* I am suggesting using `p * p <= n` as a loop test instead of computing `sqrt(x)`. I am not suggesting finding the square root by linearly testing candidates, which would indeed be completely inappropriate. – chqrlie Aug 17 '20 at 15:38
1

On modern processors with high-throughput double-precision floating-point support the fastest way to compute the integer square root ⌊√x⌋ for argument x ≤ 253 is to compute it as (uint32_t)sqrt((double)x). For a 32-bit integer square root suitable for processors with no FPU or slow double-precision support, see this answer of mine.

The square root of a 64-bit unsigned integer can be computed efficiently by first computing the reciprocal square root, 1/√x or rsqrt(x), using a low-accuracy table lookup and following this with multiple Newton-Raphson iterations in fixed-point arithmetic. The full-precision reciprocal sqaure root is then multiplied by the original argument to yield the square root. The general Newton-Raphson iteration for the reciprocal square root of a is rn+1 = rn + rn * (1 - a * rn2) / 2. This can be transformed algebraically into various convenient arrangements.

The exemplary C99 code below demonstrates the details of the algorithm outlined above. An eight-bit approximation to the reciprocal square root is retrieved from a 96-entry lookup table using the seven most significant bits of the normalized argument as the index. Normalization requires the count of leading zeros, which is a built-in hardware instruction on many processor architectures, but can also be emulated reasonably efficiently through either a single-precision floating-point computation or integer computation.

To potentially allow the use of small-operand multiplication, the initial reciprocal square root approximation r0 is refined using the following variant of the Newton-Raphson iteration: r1 = (3 * r0 - a * r03) / 2. A second iteration r2 = (r1 * (3 - r1 ( (r1 * a))) / 2 is used to refine this further. The third iteration is combined with the back multiply by a to yield the final square root approximation: s2 = a * r2, s3 = s2 + (r2 * (a - s2 * s2)) / 2.

As a last step, the final normalized square root approximation must be denormalized. The number of bit positions to shift right is half the number of bit positions shifted left during normalization. The result is an underestimate and can be up to 2 smaller than the true result. The correct result ⌊√a⌋, can be determined by examining the magnitude of the remainder.

To achieve good performance on many 32-bit platforms, the first two Newton-Raphson iterations should be performed in 32-bit arithmetic as only limited precision is required at that state. The computation can be sped up incrementally by using a larger table with 96 entries of 32 bits, where the least significant ten bits of each entry store 3 * r0 and the most significant bits store r03 rounded to 22 bits, which introduces negligible error.

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <math.h>
#if defined(_MSC_VER) && defined(_WIN64)
#include <intrin.h>
#endif // defined(_MSC_VER) && defined(_WIN64)

#define CLZ_BUILTIN  (1) // use compiler's built-in count-leading-zeros
#define CLZ_FPU      (2) // emulate count-leading-zeros via FPU
#define CLZ_CPU      (3) // emulate count-leading-zeros via CPU

#define LARGE_TABLE  (1)
#define CLZ_IMPL     (CLZ_CPU)
#define GEN_TAB      (1)

uint32_t umul32_hi (uint32_t a, uint32_t b);
uint64_t umul32_wide (uint32_t a, uint32_t b);
int clz64 (uint64_t a);

#if LARGE_TABLE
uint32_t rsqrt_tab [96] = 
{
    0xfa0bfafa, 0xee6b2aee, 0xe5f02ae5, 0xdaf26ed9, 0xd2f002d0, 0xc890c2c4,
    0xc1037abb, 0xb9a75ab2, 0xb4da42ac, 0xadcea2a3, 0xa6f27a9a, 0xa279c294,
    0x9beb4a8b, 0x97a5ca85, 0x9163427c, 0x8d4fca76, 0x89500270, 0x8563ba6a,
    0x818ac264, 0x7dc4ea5e, 0x7a120258, 0x7671da52, 0x72e4424c, 0x6f690a46,
    0x6db24243, 0x6a52423d, 0x67042637, 0x6563c234, 0x62302a2e, 0x609cea2b,
    0x5d836a25, 0x5bfd1a22, 0x58fd421c, 0x5783ae19, 0x560e4a16, 0x53300210,
    0x51c7120d, 0x50623a0a, 0x4da4c204, 0x4c4c1601, 0x4af769fe, 0x49a6b9fb,
    0x485a01f8, 0x471139f5, 0x45cc59f2, 0x448b5def, 0x4214fde9, 0x40df89e6,
    0x3fade1e3, 0x3e8001e0, 0x3d55e1dd, 0x3c2f79da, 0x3c2f79da, 0x3b0cc5d7,
    0x39edc1d4, 0x38d265d1, 0x37baa9ce, 0x36a689cb, 0x359601c8, 0x348909c5,
    0x348909c5, 0x337f99c2, 0x3279adbf, 0x317741bc, 0x30784db9, 0x30784db9,
    0x2f7cc9b6, 0x2e84b1b3, 0x2d9001b0, 0x2d9001b0, 0x2c9eb1ad, 0x2bb0b9aa,
    0x2bb0b9aa, 0x2ac615a7, 0x29dec1a4, 0x29dec1a4, 0x28fab5a1, 0x2819e99e,
    0x2819e99e, 0x273c599b, 0x273c599b, 0x26620198, 0x258ad995, 0x258ad995,
    0x24b6d992, 0x24b6d992, 0x23e5fd8f, 0x2318418c, 0x2318418c, 0x224d9d89,
    0x224d9d89, 0x21860986, 0x21860986, 0x20c18183, 0x20c18183, 0x20000180,
};
#else // LARGE_TABLE
uint8_t rsqrt_tab [96] = 
{
    0xfe, 0xfa, 0xf7, 0xf3, 0xf0, 0xec, 0xe9, 0xe6, 0xe4, 0xe1, 0xde, 0xdc,
    0xd9, 0xd7, 0xd4, 0xd2, 0xd0, 0xce, 0xcc, 0xca, 0xc8, 0xc6, 0xc4, 0xc2,
    0xc1, 0xbf, 0xbd, 0xbc, 0xba, 0xb9, 0xb7, 0xb6, 0xb4, 0xb3, 0xb2, 0xb0,
    0xaf, 0xae, 0xac, 0xab, 0xaa, 0xa9, 0xa8, 0xa7, 0xa6, 0xa5, 0xa3, 0xa2,
    0xa1, 0xa0, 0x9f, 0x9e, 0x9e, 0x9d, 0x9c, 0x9b, 0x9a, 0x99, 0x98, 0x97,
    0x97, 0x96, 0x95, 0x94, 0x93, 0x93, 0x92, 0x91, 0x90, 0x90, 0x8f, 0x8e,
    0x8e, 0x8d, 0x8c, 0x8c, 0x8b, 0x8a, 0x8a, 0x89, 0x89, 0x88, 0x87, 0x87,
    0x86, 0x86, 0x85, 0x84, 0x84, 0x83, 0x83, 0x82, 0x82, 0x81, 0x81, 0x80,
};
#endif //LARGE_TABLE 

uint32_t my_isqrt64 (uint64_t a)
{
    uint64_t rem, arg = a;
    uint32_t b, r, s, t, scal;

    /* Handle zero as special case */
    if (a == 0ULL) return 0u;
    /* Normalize argument */
    scal = clz64 (a) & ~1;
    a = a << scal;
    b = a >> 32;
    /* Generate initial approximation to 1/sqrt(a) = rsqrt(a) */
    t = rsqrt_tab [(b >> 25) - 32];
    /* Perform first NR iteration for rsqrt */
#if LARGE_TABLE
    r = (t << 22) - umul32_hi (b, t);
#else // LARGE_TABLE
    r = ((3 * t) << 22) - umul32_hi (b, (t * t * t) << 8);
#endif // LARGE_TABLE
    /* Perform second NR iteration for rsqrt */
    s = umul32_hi (r, b);
    s = 0x30000000 - umul32_hi (r, s);
    r = umul32_hi (r, s);
    /* Compute sqrt(a) as a * rsqrt(a); make sure it is an underestimate! */
    r = r * 16;
    s = umul32_hi (r, b);
    s = 2 * s - 10;
    /* Perform third NR iteration combined with back multiply */
    rem = a - umul32_wide (s, s);
    r = umul32_hi ((uint32_t)(rem >> 32), r);
    s = s + r;
    /* Denormalize result */
    s = s >> (scal >> 1);
    /* Make sure we get the floor correct; result underestimates by up to 2 */
    rem = arg - umul32_wide (s, s);
    if (rem >= ((uint64_t)s * 4 + 4)) s+=2;
    else if (rem >= ((uint64_t)s * 2 + 1)) s++;
    return s;
}

uint32_t umul32_hi (uint32_t a, uint32_t b)
{
    return (uint32_t)(((uint64_t)a * b) >> 32);
}

uint64_t umul32_wide (uint32_t a, uint32_t b)
{
    return (uint64_t)a * b;
}

uint32_t float_as_uint32 (float a)
{
    uint32_t r;
    memcpy (&r, &a, sizeof r);
    return r;
}

int clz32 (uint32_t a)
{
#if (CLZ_IMPL == CLZ_FPU)
    // Henry S. Warren, Jr, " Hacker's Delight 2nd ed", p. 105
    int n = 158 - (float_as_uint32 ((float)(int32_t)(a & ~(a >> 1))+.5f) >> 23);
    return (n < 0) ? 0 : n;
#elif (CLZ_IMPL == CLZ_CPU)
    static const uint8_t clz_tab[32] = {
        31, 22, 30, 21, 18, 10, 29,  2, 20, 17, 15, 13, 9,  6, 28, 1,
        23, 19, 11,  3, 16, 14,  7, 24, 12,  4,  8, 25, 5, 26, 27, 0
    };
    a |= a >> 16;
    a |= a >> 8;
    a |= a >> 4;
    a |= a >> 2;
    a |= a >> 1;
    return clz_tab [0x07c4acddu * a >> 27] + (!a);
#else // CLZ_IMPL == CLZ_BUILTIN
#if defined(_MSC_VER) && defined(_WIN64)
    return (int)__lzcnt (a);
#else // defined(_MSC_VER) && defined(_WIN64)
    return (int)__builtin_clz (a);
#endif // defined(_MSC_VER) && defined(_WIN64)
#endif // CLZ_IMPL
}

int clz64 (uint64_t a)
{
#if (CLZ_IMPL == CLZ_BUILTIN)
#if defined(_MSC_VER) && defined(_WIN64)
    return (int)__lzcnt64 (a);
#else // defined(_MSC_VER) && defined(_WIN64)
    return (int)__builtin_clzll (a);
#endif // defined(_MSC_VER) && defined(_WIN64)
#else // CLZ_IMPL
    uint32_t ah = (uint32_t)(a >> 32);
    uint32_t al = (uint32_t)(a);
    int r;
    if (ah) al = ah;
    r = clz32 (al);
    if (!ah) r += 32;
    return r;
#endif // CLZ_IMPL
}

/* Henry S. Warren, Jr., "Hacker's Delight, 2nd e.d", p. 286 */
uint32_t ref_isqrt64 (uint64_t x)
{
    uint64_t m, y, b;
    m = 0x4000000000000000ULL;
    y = 0ULL;
    while (m != 0) {
        b = y | m;
        y = y >> 1;
        if (x >= b) {
            x = x - b;
            y = y | m;
        }
        m = m >> 2;
    }
    return (uint32_t)y;
}

/*
  https://groups.google.com/forum/#!original/comp.lang.c/qFv18ql_WlU/IK8KGZZFJx4J
  From: geo <gmars...@gmail.com>
  Newsgroups: sci.math,comp.lang.c,comp.lang.fortran
  Subject: 64-bit KISS RNGs
  Date: Sat, 28 Feb 2009 04:30:48 -0800 (PST)

  This 64-bit KISS RNG has three components, each nearly
  good enough to serve alone.    The components are:
  Multiply-With-Carry (MWC), period (2^121+2^63-1)
  Xorshift (XSH), period 2^64-1
  Congruential (CNG), period 2^64
*/
static uint64_t kiss64_x = 1234567890987654321ULL;
static uint64_t kiss64_c = 123456123456123456ULL;
static uint64_t kiss64_y = 362436362436362436ULL;
static uint64_t kiss64_z = 1066149217761810ULL;
static uint64_t kiss64_t;
#define MWC64  (kiss64_t = (kiss64_x << 58) + kiss64_c, \
                kiss64_c = (kiss64_x >> 6), kiss64_x += kiss64_t, \
                kiss64_c += (kiss64_x < kiss64_t), kiss64_x)
#define XSH64  (kiss64_y ^= (kiss64_y << 13), kiss64_y ^= (kiss64_y >> 17), \
                kiss64_y ^= (kiss64_y << 43))
#define CNG64  (kiss64_z = 6906969069ULL * kiss64_z + 1234567ULL)
#define KISS64 (MWC64 + XSH64 + CNG64)

#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
    LARGE_INTEGER t;
    static double oofreq;
    static int checkedForHighResTimer;
    static BOOL hasHighResTimer;

    if (!checkedForHighResTimer) {
        hasHighResTimer = QueryPerformanceFrequency (&t);
        oofreq = 1.0 / (double)t.QuadPart;
        checkedForHighResTimer = 1;
    }
    if (hasHighResTimer) {
        QueryPerformanceCounter (&t);
        return (double)t.QuadPart * oofreq;
    } else {
        return (double)GetTickCount() * 1.0e-3;
    }
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif

int main (void)
{
#if LARGE_TABLE
    printf ("64-bit integer square root implementation w/ large table\n");
#else // LARGE_TAB
    printf ("64-bit integer square root implementation w/ small table\n");
#endif

#if GEN_TAB
    printf ("Generating table ...\n");
    for (int i = 0; i < 96; i++ ) {
        double x1 = 1.0 + i * 1.0 / 32;
        double x2 = 1.0 + (i + 1) * 1.0 / 32;
        double y = (1.0 / sqrt (x1) + 1.0 / sqrt (x2)) * 0.5;
        uint32_t val = (uint32_t)(y * 256 + 0.5);
#if LARGE_TABLE
        uint32_t cube = val * val * val;
        rsqrt_tab[i] = (((cube + 1) / 4) << 10) + (3 * val);
        printf ("0x%08x, ", rsqrt_tab[i]);
        if (i % 6 == 5) printf ("\n");
#else // LARGE_TABLE
        rsqrt_tab[i] = (uint8_t)val;
        printf ("0x%02x, ", rsqrt_tab[i]);
        if (i % 12 == 11) printf ("\n");
#endif // LARGE_TABLE
    }
#endif // GEN_TAB

    printf ("Running benchmark ...\n");

    double start, stop;
    uint32_t sum[8] = {0, 0, 0, 0, 0, 0, 0, 0};
    for (int k = 0; k < 2; k++) {
        uint32_t i = 0;
        start = second();
        do {
            sum [0] += my_isqrt64 (0x31415926ULL * i + 0);
            sum [1] += my_isqrt64 (0x31415926ULL * i + 1);
            sum [2] += my_isqrt64 (0x31415926ULL * i + 2);
            sum [3] += my_isqrt64 (0x31415926ULL * i + 3);
            sum [4] += my_isqrt64 (0x31415926ULL * i + 4);
            sum [5] += my_isqrt64 (0x31415926ULL * i + 5);
            sum [6] += my_isqrt64 (0x31415926ULL * i + 6);
            sum [7] += my_isqrt64 (0x31415926ULL * i + 7);
            i += 8;
        } while (i);
        stop = second();
    }
    printf ("%08x\relapsed=%.5f sec\n", 
            sum[0]+sum[1]+sum[2]+sum[3]+sum[4]+sum[5]+sum[6]+sum[7],
            stop - start);

    printf ("Running functional test ...\n");
    uint64_t a, count = 0;
    uint32_t res, ref;
    do {
        switch (count >> 33) {
        case 0:
            a = count;
            break;
        case 1:
            a = (count & ((1ULL << 33) - 1)) * (count & ((1ULL << 33) - 1) - 1);
            break;
        case 2:
            a = (count & ((1ULL << 33) - 1)) * (count & ((1ULL << 33) - 1));
            break;
        case 3:
            a = (count & ((1ULL << 33) - 1)) * (count & ((1ULL << 33) - 1)) + 1;
            break;
        default:
            a = KISS64;
            break;
        }
        res = my_isqrt64 (a);
        ref = ref_isqrt64 (a);
        if (res != ref) {
            printf ("\nerror: arg=%016llx  res=%08x  ref=%08x  count=%llx\n", a, res, ref, count);
            return EXIT_FAILURE;
        }
        count++;
        if (!(count & 0xffffff)) printf ("\r%016llx", count);
    } while (count);
    printf ("PASSED\n");
    return EXIT_SUCCESS;
}
njuffa
  • 23,970
  • 4
  • 78
  • 130
0

Needed a square root algorithm for another purpose and came upon this thread on search. What I finally ended up with was a notion that sqrt is nearly linear with large values.

If the needed precision is for example sqrt(x) > estimate > sqrt(x)-1 one could use values like this: 0, 16, 146, 581, 1612, 3623, 7100 ... 100 values ... 549043200, 569728768 for standard sqrt function and linearize between them.

Note: values above are estimates, could be a bit wrong. Purpose is just to show how large spans one can use for linearization if the need is several sqrt values that are of the same magnitude.

Peter Csala
  • 17,736
  • 16
  • 35
  • 75
0

For an interesting project on AVR16DD (8bit) I also need a fast sqrt. Here is my today solution:

char Bit_Width (int x)
{   char n = 0;
    while (x > 0)
    {   n++;
        x = x >> 1;
    }
    return n;
}

short unsigned int Sqrt_suInt(short unsigned int x)
{   unsigned char width;
    short unsigned int result;
    if (x < 5)
    {   result = x/2 + x%2;
    }
    else
    {   width = Bit_Width(x) - 1;   /* width > 1 */
        result = 1 << (width/2);
        result = result | ((width % 2) << (width/2 - 1));
        {   short int appendix = x & ~(1 << width);
            appendix = appendix >> (width + 3)/2;
            result += appendix;
        }
        result = (x / result + result) / 2;
    }
    return result;
}

If you would like only an approximate result, you can then omit the last line with division. It's pritty easy and fast. This is only a princip, the next step will be rewriting efficiently in ASM.