76

As we know if n is not a perfect square, then sqrt(n) would not be an integer. Since I need only the integer part, I feel that calling sqrt(n) wouldn't be that fast, as it takes time to calculate the fractional part also.

So my question is,

Can we get only the integer part of sqrt(n) without calculating the actual value of sqrt(n)? The algorithm should be faster than sqrt(n) (defined in <math.h> or <cmath>)?

If possible, you can write the code in asm block also.

Steffi Keran Rani J
  • 3,667
  • 4
  • 34
  • 56
Nawaz
  • 353,942
  • 115
  • 666
  • 851
  • 28
    Most CPUs perform sqrt in hardware, so it's unlikely that you'll be able to go faster by computing only the integer part. – Gabe Feb 08 '11 at 06:44
  • Here is an interesting link for a more deterministic algorithm: http://www.embedded-systems.com/98/9802fe2.htm – leppie Feb 08 '11 at 06:46
  • 4
    @Gabe: `sqrt()` in the C library is unlikely to be implemented directly as a hardware `sqrt` instruction on all machines, since the hardware might not handle all the corner cases required by IEEE 754. If you don't care, you might use inline asm or gcc's `-ffast-math` to get directly at the hardware. – R.. GitHub STOP HELPING ICE Feb 08 '11 at 07:04
  • May be this [link](http://stackoverflow.com/questions/295579/fastest-way-to-determine-if-an-integers-square-root-is-an-integer) can help you. – Emil Feb 08 '11 at 07:05
  • @R: http://assemblyrequired.crashworks.org/2009/10/ shows a few different ways to compute sqrt with FP math; `sqrt(x)` (which is just `FSQRT`) is the slowest at 24ns, with SIMD versions being the fastest, averaging less than 1ns for an approximation. – Gabe Feb 08 '11 at 07:32
  • What range of n are you interested in? – Dipstick Feb 08 '11 at 07:52
  • @Dipstick: Any range, as long as a built-in type can represent it. – Nawaz Feb 08 '11 at 10:19
  • 14
    did you profile your application? Are you sure you need to improve the sqrt(n) speed? – Alessandro Teruzzi Feb 08 '11 at 12:56

14 Answers14

22

I would try the Fast Inverse Square Root trick.

It's a way to get a very good approximation of 1/sqrt(n) without any branch, based on some bit-twiddling so not portable (notably between 32-bits and 64-bits platforms).

Once you get it, you just need to inverse the result, and takes the integer part.

There might be faster tricks, of course, since this one is a bit of a round about.

EDIT: let's do it!

First a little helper:

// benchmark.h
#include <sys/time.h>

template <typename Func>
double benchmark(Func f, size_t iterations)
{
  f();

  timeval a, b;
  gettimeofday(&a, 0);
  for (; iterations --> 0;)
  {
    f();
  }
  gettimeofday(&b, 0);
  return (b.tv_sec * (unsigned int)1e6 + b.tv_usec) -
         (a.tv_sec * (unsigned int)1e6 + a.tv_usec);
}

Then the main body:

#include <iostream>

#include <cmath>

#include "benchmark.h"

class Sqrt
{
public:
  Sqrt(int n): _number(n) {}

  int operator()() const
  {
    double d = _number;
    return static_cast<int>(std::sqrt(d) + 0.5);
  }

private:
  int _number;
};

// http://www.codecodex.com/wiki/Calculate_an_integer_square_root
class IntSqrt
{
public:
  IntSqrt(int n): _number(n) {}

  int operator()() const 
  {
    int remainder = _number;
    if (remainder < 0) { return 0; }

    int place = 1 <<(sizeof(int)*8 -2);

    while (place > remainder) { place /= 4; }

    int root = 0;
    while (place)
    {
      if (remainder >= root + place)
      {
        remainder -= root + place;
        root += place*2;
      }
      root /= 2;
      place /= 4;
    }
    return root;
  }

private:
  int _number;
};

// http://en.wikipedia.org/wiki/Fast_inverse_square_root
class FastSqrt
{
public:
  FastSqrt(int n): _number(n) {}

  int operator()() const
  {
    float number = _number;

    float x2 = number * 0.5F;
    float y = number;
    long i = *(long*)&y;
    //i = (long)0x5fe6ec85e7de30da - (i >> 1);
    i = 0x5f3759df - (i >> 1);
    y = *(float*)&i;

    y = y * (1.5F - (x2*y*y));
    y = y * (1.5F - (x2*y*y)); // let's be precise

    return static_cast<int>(1/y + 0.5f);
  }

private:
  int _number;
};


int main(int argc, char* argv[])
{
  if (argc != 3) {
    std::cerr << "Usage: %prog integer iterations\n";
    return 1;
  }

  int n = atoi(argv[1]);
  int it = atoi(argv[2]);

  assert(Sqrt(n)() == IntSqrt(n)() &&
          Sqrt(n)() == FastSqrt(n)() && "Different Roots!");
  std::cout << "sqrt(" << n << ") = " << Sqrt(n)() << "\n";

  double time = benchmark(Sqrt(n), it);
  double intTime = benchmark(IntSqrt(n), it);
  double fastTime = benchmark(FastSqrt(n), it);

  std::cout << "Number iterations: " << it << "\n"
               "Sqrt computation : " << time << "\n"
               "Int computation  : " << intTime << "\n"
               "Fast computation : " << fastTime << "\n";

  return 0;
}

And the results:

sqrt(82) = 9
Number iterations: 4096
Sqrt computation : 56
Int computation  : 217
Fast computation : 119

// Note had to tweak the program here as Int here returns -1 :/
sqrt(2147483647) = 46341 // real answer sqrt(2 147 483 647) = 46 340.95
Number iterations: 4096
Sqrt computation : 57
Int computation  : 313
Fast computation : 119

Where as expected the Fast computation performs much better than the Int computation.

Oh, and by the way, sqrt is faster :)

Matthieu M.
  • 287,565
  • 48
  • 449
  • 722
  • This is for floating point but nawaz just needs integer values. – Saeed Amiri Feb 08 '11 at 07:32
  • @Saeed: an integer can be trivially converted to a float (back and forth) and I am curious about the applicability of this method. It certainly is the only branchless method without a pre-computed table that I could think of. After that... I guess we could benchmark :) ? – Matthieu M. Feb 08 '11 at 07:36
  • Yes it can, but I think methods (like article I referenced) are faster (because they are just care about integer parts) but yes should benchmark this ways. – Saeed Amiri Feb 08 '11 at 07:52
  • @Saeed: done, as expected the Fast Inverse Trick performs better, being branchless pays off I guess – Matthieu M. Feb 08 '11 at 08:44
  • @Saeed: normally you've got all the includes. I wonder what `-ffast-math` would give for `sqrt`. – Matthieu M. Feb 08 '11 at 09:01
  • @Matthieu M : good attempt. +1.. but as I had expected `sqrt` would still be faster. In fact, `sqrt` is awesomely faster!! – Nawaz Feb 08 '11 at 10:25
  • @Nawaz: actually, I had a good enough precision with a single iteration of Newton's method in the Fast case which gave only `86`, so I don't find `sqrt` that fast, I would have hoped a hardware accelerated version would perform much better :/ – Matthieu M. Feb 08 '11 at 10:36
  • `double` precision isn't good enough for 64-bit square roots and `float` precision isn't good enough for 32-bit square roots. Try anything beyond 2^31 any you will see inconsistent rounding with floats. Or try anything beyond 2^63 with doubles. – Jan Schultke Aug 17 '20 at 16:57
  • `sqrt` is faster because most modern architectures have a `rsqrt` instruction to calculate the reverse square root very fast, so they only need 1-3 more iterations to refine the result – phuclv Mar 13 '23 at 00:27
17

Edit: this answer is foolish - use (int) sqrt(i)

After profiling with proper settings (-march=native -m64 -O3) the above was a lot faster.


Alright, a bit old question, but the "fastest" answer has not been given yet. The fastest (I think) is the Binary Square Root algorithm, explained fully in this Embedded.com article.

It basicly comes down to this:

unsigned short isqrt(unsigned long a) {
    unsigned long rem = 0;
    int root = 0;
    int i;

    for (i = 0; i < 16; i++) {
        root <<= 1;
        rem <<= 2;
        rem += a >> 30;
        a <<= 2;

        if (root < rem) {
            root++;
            rem -= root;
            root++;
        }
    }

    return (unsigned short) (root >> 1);
}

On my machine (Q6600, Ubuntu 10.10) I profiled by taking the square root of the numbers 1-100000000. Using iqsrt(i) took 2750 ms. Using (unsigned short) sqrt((float) i) took 3600ms. This was done using g++ -O3. Using the -ffast-math compile option the times were 2100ms and 3100ms respectively. Note this is without using even a single line of assembler so it could probably still be much faster.

The above code works for both C and C++ and with minor syntax changes also for Java.

What works even better for a limited range is a binary search. On my machine this blows the version above out of the water by a factor 4. Sadly it's very limited in range:

#include <stdint.h>

const uint16_t squares[] = {
    0, 1, 4, 9,
    16, 25, 36, 49,
    64, 81, 100, 121,
    144, 169, 196, 225,
    256, 289, 324, 361,
    400, 441, 484, 529,
    576, 625, 676, 729,
    784, 841, 900, 961,
    1024, 1089, 1156, 1225,
    1296, 1369, 1444, 1521,
    1600, 1681, 1764, 1849,
    1936, 2025, 2116, 2209,
    2304, 2401, 2500, 2601,
    2704, 2809, 2916, 3025,
    3136, 3249, 3364, 3481,
    3600, 3721, 3844, 3969,
    4096, 4225, 4356, 4489,
    4624, 4761, 4900, 5041,
    5184, 5329, 5476, 5625,
    5776, 5929, 6084, 6241,
    6400, 6561, 6724, 6889,
    7056, 7225, 7396, 7569,
    7744, 7921, 8100, 8281,
    8464, 8649, 8836, 9025,
    9216, 9409, 9604, 9801,
    10000, 10201, 10404, 10609,
    10816, 11025, 11236, 11449,
    11664, 11881, 12100, 12321,
    12544, 12769, 12996, 13225,
    13456, 13689, 13924, 14161,
    14400, 14641, 14884, 15129,
    15376, 15625, 15876, 16129,
    16384, 16641, 16900, 17161,
    17424, 17689, 17956, 18225,
    18496, 18769, 19044, 19321,
    19600, 19881, 20164, 20449,
    20736, 21025, 21316, 21609,
    21904, 22201, 22500, 22801,
    23104, 23409, 23716, 24025,
    24336, 24649, 24964, 25281,
    25600, 25921, 26244, 26569,
    26896, 27225, 27556, 27889,
    28224, 28561, 28900, 29241,
    29584, 29929, 30276, 30625,
    30976, 31329, 31684, 32041,
    32400, 32761, 33124, 33489,
    33856, 34225, 34596, 34969,
    35344, 35721, 36100, 36481,
    36864, 37249, 37636, 38025,
    38416, 38809, 39204, 39601,
    40000, 40401, 40804, 41209,
    41616, 42025, 42436, 42849,
    43264, 43681, 44100, 44521,
    44944, 45369, 45796, 46225,
    46656, 47089, 47524, 47961,
    48400, 48841, 49284, 49729,
    50176, 50625, 51076, 51529,
    51984, 52441, 52900, 53361,
    53824, 54289, 54756, 55225,
    55696, 56169, 56644, 57121,
    57600, 58081, 58564, 59049,
    59536, 60025, 60516, 61009,
    61504, 62001, 62500, 63001,
    63504, 64009, 64516, 65025
};

inline int isqrt(uint16_t x) {
    const uint16_t *p = squares;

    if (p[128] <= x) p += 128;
    if (p[ 64] <= x) p +=  64;
    if (p[ 32] <= x) p +=  32;
    if (p[ 16] <= x) p +=  16;
    if (p[  8] <= x) p +=   8;
    if (p[  4] <= x) p +=   4;
    if (p[  2] <= x) p +=   2;
    if (p[  1] <= x) p +=   1;

    return p - squares;
}

A 32 bit version can be downloaded here: https://gist.github.com/3481770

Wayne Conrad
  • 103,207
  • 26
  • 155
  • 191
orlp
  • 112,504
  • 36
  • 218
  • 315
  • Try my branchless variant and see if it's faster. – R.. GitHub STOP HELPING ICE Aug 26 '12 at 17:02
  • @R.: nope, it's slower by around a factor 3. – orlp Aug 26 '12 at 17:10
  • Is your compiler perhaps using `cmov` for your version? – R.. GitHub STOP HELPING ICE Aug 26 '12 at 17:10
  • You might also try this simplified version with (implementation-defined) signed right-shift: `for (s=squares, i=128; i; i=i>>1) s += s[i]-x-1>>31 & i; return s-squares;` – R.. GitHub STOP HELPING ICE Aug 26 '12 at 17:13
  • 1
    @R.: nope it doesn't use `cmov`. Also, hand unrolling the loop actually is faster by around 20%. Here is the asm output for both versions (note that I made the 32 bit version): https://gist.github.com/3481749 The full 32 bit version can be downloaded here: https://gist.github.com/3481770 – orlp Aug 26 '12 at 17:19
  • You should move the "this answer is bullshit" edit to the top of your answer. I wasted time reading bullshit `;-)` – rubenvb Oct 19 '13 at 11:52
  • `double` precision isn't good enough for 64-bit square roots and `float` precision isn't good enough for 32-bit square roots. Try anything beyond 2^31 any you will see inconsistent rounding with floats. Or try anything beyond 2^63 with doubles. – Jan Schultke Aug 17 '20 at 16:57
6

If you don't mind an approximation, how about this integer sqrt function I cobbled together.

int sqrti(int x)
{
    union { float f; int x; } v; 

    // convert to float
    v.f = (float)x;

    // fast aprox sqrt
    //  assumes float is in IEEE 754 single precision format 
    //  assumes int is 32 bits
    //  b = exponent bias
    //  m = number of mantissa bits
    v.x  -= 1 << 23; // subtract 2^m 
    v.x >>= 1;       // divide by 2
    v.x  += 1 << 29; // add ((b + 1) / 2) * 2^m

    // convert to int
    return (int)v.f;
}

It uses the algorithm described in this Wikipedia article. On my machine it's almost twice as fast as sqrt :)

Shmo
  • 346
  • 4
  • 5
  • 4
    Technically this breaks the strict aliasing rule. It doesn't seem to cause a problem under recent gcc (4.9), but the compliant way of doing it would be `union { float f; int32_t x } v; v.f = (float) x; v.x -= ... return (int)((float)v.x);`. – Viktor Dahl Apr 22 '15 at 19:35
  • 1
    Not bad it serves me to learn something, but how do you people measure speed? all the sqrt(int) functions I found on www are slower than std::sqrt. acording to my measuerments. your function is 27% slower and 3% less accurate btw. than std – metablaster Sep 27 '19 at 18:05
6

While I suspect you can find a plenty of options by searching for "fast integer square root", here are some potentially-new ideas that might work well (each independent, or maybe you can combine them):

  1. Make a static const array of all the perfect squares in the domain you want to support, and perform a fast branchless binary search on it. The resulting index in the array is the square root.
  2. Convert the number to floating point and break it into mantissa and exponent. Halve the exponent and multiply the mantissa by some magic factor (your job to find it). This should be able to give you a very close approximation. Include a final step to adjust it if it's not exact (or use it as a starting point for the binary search above).
R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711
  • I liked your (1) but I don't think you need to precalculate the perfect squares. Of course it depends on how many times he is going to be calling that, but precalculating it for a few calls would be a lot more expensive than the bs by itself. – Piva Feb 08 '11 at 06:55
  • 4
    If you want to square the 'index' at every step of your binary search, be my guest. It will be *slooooooow*. That's why I suggested precalculating them. Note that I said `static const`. There is no cost to computing it because it happened before your program was compiled. And even if you support the full range of 32-bit integers, your table will only be 256kb. – R.. GitHub STOP HELPING ICE Feb 08 '11 at 06:56
  • 2
    I have used strategy 1 in a high performance context and it worked beautifully. I further enhanced the performance of the search using the knowledge that the next sqrt to be taken was likely to be close to the previous one (the context was graphical) ad it made a staggering performance difference. – Elemental Feb 08 '11 at 07:11
  • 2
    @R.. : I don't think (1) will be faster than `sqrt`; binary searching on a list of 999999 integers would most likely to be slow than sqrt! – Nawaz Feb 08 '11 at 07:18
  • 2
    @Nawaz: given you apparently care enough to ask the question, how about benchmarking it before condemning it. Much will depend on your exact hardware.... – Tony Delroy Feb 08 '11 at 07:35
  • 1
    @R.. : squaring the index takes a single integer operation. There are plenty of integer pipelines. Accessing memory is _way_ more expensive because that's a shared resource. Besides, you'd first have to calculate the address `square[i]` which like `i*i` is a single integer operation. So even if accessing `square[i]` would be free, it still wouldn't be faster. – MSalters Feb 08 '11 at 08:11
  • I'd love to see this benchmarked! I'd also love to see what a "branchless binary search" would look like. – Gabe Feb 08 '11 at 13:52
  • Multiplication is much slower than addition, at least in latency. And in the binary search, latency is what will matter since you can't do the next step without the result of the previous (well, you could try to start squaring both possible next attempts, and then throw away one you don't need). – R.. GitHub STOP HELPING ICE Feb 08 '11 at 19:19
  • 1
    As for what "branchless binary search" means, it means simply advancing the `left` index of the binary search by an integer expression which evaluates to 0 or `len/2` based on the difference of the value being tested and the value being sought, for example by making a mask out of the high bit. It can also be done using `cmov`-type instructions. – R.. GitHub STOP HELPING ICE Feb 08 '11 at 19:21
  • @Nawaz: You don't search the whole list. You can come up with some obvious initial bounds. – R.. GitHub STOP HELPING ICE Feb 08 '11 at 19:22
  • 1
    @Nawaz and R.: I implemented solution 1 in C for uint16_t's, but it seems to get outperformed by my solution and even `(int) sqrt` on gcc -O3. You can look at it here: https://gist.github.com/3481295 . Maybe you can improve my implementation? – orlp Aug 26 '12 at 15:59
  • Each of those `<` operators will compile to a conditional branch (or at least conditional move) because the compiler can't assume anything about the range of values. You can, so you could (for example) use `unsigned` expressions and just use bit 31 of the difference to get a 0/1 result based on which is greater. – R.. GitHub STOP HELPING ICE Aug 26 '12 at 16:20
  • 3
    @Nawaz and R: I actually implemented it much better now. It's not branchless but it blows everything else so far out of the water: https://gist.github.com/3481607 – orlp Aug 26 '12 at 16:52
  • 1
    Here's my version: `for (s=squares, i=128; i; i=i>>1) s+=-((unsigned)(s[i]-x-1)>>31) & i; return s-squares;` Note that you can fix your table to remove the unwanted `-1` from the code. There's no need to unroll the loop by hand; `gcc -O3` will do it for you. – R.. GitHub STOP HELPING ICE Aug 26 '12 at 16:57
  • @R..GitHubSTOPHELPINGICE yours and mine are both far slower than sqrt, but orlp's is faster https://quick-bench.com/q/tSgC5Dr4iIiDnuxdmLNNB7Vd110. I'll have to check the disassembly later – Mooing Duck May 17 '23 at 23:12
  • I've confirmed that all three are accurate http://coliru.stacked-crooked.com/a/8c80e6fede506deb and that orlp's implementation is indeed branchless, despite the `if`s in the code. https://godbolt.org/z/TeKj5reev. orlp's is also 3 ops-per-bit, wheras yours is 5 ops-per-bit. – Mooing Duck May 18 '23 at 04:17
  • Curiously, if I convert his to `p += (p[128] <= x)<<7;`, then it becomes 5-ops-per-bit and is even slower than yours :O https://quick-bench.com/q/Wos2u5byYXrnmCff82T8ZO_19SY – Mooing Duck May 18 '23 at 04:27
5

The following solution computes the integer part, meaning floor(sqrt(x)) exactly, with no rounding errors.

Problems With Other Approaches

  • using float or double is neither portable nor precise enough
  • @orlp's isqrt gives insane results like isqrt(100) = 15
  • approaches based on huge lookup tables are not practical beyond 32 bits
  • using a fast inverse sqrt is very imprecise, you're better off using sqrtf
  • Newton's approach requires expensive integer division and a good initial guess

My Approach

Mine is based on the bit-guessing approach proposed on Wikipedia. Unfortunately the pseudo-code provided on Wikipedia has some errors so I had to make some adjustments:

// C++20 also provides std::bit_width in its <bit> header
unsigned char bit_width(unsigned long long x) {
    return x == 0 ? 1 : 64 - __builtin_clzll(x);
}

template <typename Int, std::enable_if_t<std::is_unsigned<Int, int = 0>>
Int sqrt(const Int n) {
    unsigned char shift = bit_width(n);
    shift += shift & 1; // round up to next multiple of 2

    Int result = 0;

    do {
        shift -= 2;
        result <<= 1; // make space for the next guessed bit
        result |= 1;  // guess that the next bit is 1
        result ^= result * result > (n >> shift); // revert if guess too high
    } while (shift != 0);

    return result;
}

bit_width can be evaluated in constant time and the loop will iterate at most ceil(bit_width / 2) times. So even for a 64-bit integer, this will be at worst 32 iterations of basic arithmetic and bitwise operations.

The compile output is only around 20 instructions.

Performance

I have benchmarked my methods against float-bases ones by generating inputs uniformly. Note that in the real world most inputs would be much closer to zero than to std::numeric_limits<...>::max().

  • for uint32_t this performs about 25x worse than using std::sqrt(float)
  • for uint64_t this performs about 30x worse than using std::sqrt(double)

Accuracy

This method is always perfectly accurate, unlike approaches using floating point math.

  • Using sqrtf can provide incorrect rounding in the [228, 232) range. For example, sqrtf(0xffffffff) = 65536, when the square root is actually 65535.99999.
  • Double precision doesn't work consistently for the [260, 264) range. For example, sqrt(0x3fff...) = 2147483648, when the square root is actually 2147483647.999999.

The only thing that covers all 64-bit integers is x86 extended precision long double, simply because it can fit an entire 64-bit integer.

Conclusion

As I said, this the only solution that handles all inputs correctly, avoids integer division and doesn't require lookup tables. In summary, if you need a method that is independent of precision and doesn't require gigantic lookup tables, this is your only option. It might be especially useful in a constexpr context where performance isn't critical and where it could be much more important to get a 100% accurate result.

Alternative Approach Using Newton's Method

Newton's method can be quite fast when starting with a good guess. For our guess, we will round down to the next power of 2 and compute the square root in constant time. For any number 2x, we can obtain the square root using 2x/2.

template <typename Int, std::enable_if_t<std::is_unsigned_v<Int>, int> = 0>
Int sqrt_guess(const Int n)
{
    Int log2floor = bit_width(n) - 1;
    // sqrt(x) is equivalent to pow(2, x / 2 = x >> 1)
    // pow(2, x) is equivalent to 1 << x
    return 1 << (log2floor >> 1);
}

Note that this is not exactly 2x/2 because we lost some precision during the rightshift. Instead it is 2floor(x/2). Also note that sqrt_guess(0) = 1 which is actually necessary to avoid division by zero in the first iteration:

template <typename Int, std::enable_if_t<std::is_unsigned_v<Int>, int> = 0>
Int sqrt_newton(const Int n)
{
    Int a = sqrt_guess(n);
    Int b = n;
    
    // compute unsigned difference
    while (std::max(a, b) - std::min(a, b) > 1) {
        b = n / a;
        a = (a + b) / 2;
    }

    // a is now either floor(sqrt(n)) or ceil(sqrt(n))
    // we decrement in the latter case
    // this is overflow-safe as long as we start with a lower bound guess
    return a - (a * a > n);
}

This alternative approach performs roughly equivalent to the first proposal, but is usually a few percentage points faster. However, it heavily relies on efficient hardware division and result can vary heavily.

The use of sqrt_guess makes a huge difference. It is roughly five times faster than using 1 as the initial guess.

Jan Schultke
  • 17,446
  • 6
  • 47
  • 96
4

To do integer sqrt you can use this specialization of newtons method:

Def isqrt(N):

    a = 1
    b = N

    while |a-b| > 1
        b = N / a
        a = (a + b) / 2

    return a

Basically for any x the sqrt lies in the range (x ... N/x), so we just bisect that interval at every loop for the new guess. Sort of like binary search but it converges must faster.

This converges in O(loglog(N)) which is very fast. It also doesn't use floating point at all, and it will also work well for arbitrary precision integers.

Andrew Tomazos
  • 66,139
  • 40
  • 186
  • 319
  • 2
    I tried that one on iPhone hardware, but it seemed to be slow because of the 'b = N / a' operation. – Fredrik Johansson Aug 01 '13 at 17:19
  • 2
    @AndrewTomazos — Unfortunately, your function fails to return the correct answer for N ∈ { 0, 3, 8, 48, 63, 120, 143, ... }. – Todd Lehman Jul 18 '15 at 07:19
  • 1
    This is only useful if integer division is much more efficient than FP sqrt. Or maybe if you need to handle integers that are too large to have exact float representations. Modern x86 hardware ([Intel Haswell](http://agner.org/optimize/)) can convert to float and back, and do a double-precision FP sqrt, in about 25 clock cycles latency. A single 32bit integer division has 22 to 29 cycles latency. Integer Newton iterations don't do well from a throughput perspective, either. – Peter Cordes Dec 26 '15 at 22:42
4

This is so short that it 99% inlines:

static inline int sqrtn(int num) {
    int i = 0;
    __asm__ (
        "pxor %%xmm0, %%xmm0\n\t"   // clean xmm0 for cvtsi2ss
        "cvtsi2ss %1, %%xmm0\n\t"   // convert num to float, put it to xmm0
        "sqrtss %%xmm0, %%xmm0\n\t" // square root xmm0
        "cvttss2si %%xmm0, %0"      // float to int
        :"=r"(i):"r"(num):"%xmm0"); // i: result, num: input, xmm0: scratch register
    return i;
}

Why clean xmm0? Documentation of cvtsi2ss

The destination operand is an XMM register. The result is stored in the low doubleword of the destination operand, and the upper three doublewords are left unchanged.

GCC Intrinsic version (runs only on GCC):

#include <xmmintrin.h>
int sqrtn2(int num) {
    register __v4sf xmm0 = {0, 0, 0, 0};
    xmm0 = __builtin_ia32_cvtsi2ss(xmm0, num);
    xmm0 = __builtin_ia32_sqrtss(xmm0);
    return __builtin_ia32_cvttss2si(xmm0);
}

Intel Intrinsic version (tested on GCC, Clang, ICC):

#include <xmmintrin.h>
int sqrtn2(int num) {
    register __m128 xmm0 = _mm_setzero_ps();
    xmm0 = _mm_cvt_si2ss(xmm0, num);
    xmm0 = _mm_sqrt_ss(xmm0);
    return _mm_cvtt_ss2si(xmm0);
}

^^^^ All of them require SSE 1 (not even SSE 2).

Note: This is exactly how GCC calculates (int) sqrt((float) num) with -Ofast. If you want higher accuracy for larger i, then we can calculate (int) sqrt((double) num) (as noted by Gumby The Green in the comments):

static inline int sqrtn(int num) {
    int i = 0;
    __asm__ (
        "pxor %%xmm0, %%xmm0\n\t"
        "cvtsi2sd %1, %%xmm0\n\t"
        "sqrtsd %%xmm0, %%xmm0\n\t"
        "cvttsd2si %%xmm0, %0"
        :"=r"(i):"r"(num):"%xmm0");
    return i;
}

or

#include <xmmintrin.h>
int sqrtn2(int num) {
    register __v2df xmm0 = {0, 0};
    xmm0 = __builtin_ia32_cvtsi2sd(xmm0, num);
    xmm0 = __builtin_ia32_sqrtsd(xmm0);
    return __builtin_ia32_cvttsd2si(xmm0);
}
MCCCS
  • 1,002
  • 3
  • 20
  • 44
  • Do you really need to zero xmm0 before you copy a value into it? Also, inlining can disable optimization for surrounding code. What about using builtins (ie `__builtin_ia32_cvtsi2ss`, `__builtin_ia32_sqrtss` `__builtin_ia32_cvtss2si`) from [here](https://gcc.gnu.org/onlinedocs/gcc/x86-Built-in-Functions.html)? When in comes to using inline asm, [less is more](https://gcc.gnu.org/wiki/DontUseInlineAsm). – David Wohlferd Jun 29 '18 at 20:03
  • 1
    The Intel Intrinsic version doesn't round down - it rounds to the nearest int. To fix this, add a `t` to its last line: `return _mm_cvtt_ss2si(xmm0);`. These are 5-6x faster than `sqrt()` on my machine) but wrong answers start appearing when `num` >= 16,785,407 due to rounding errors on the float. To fix this in the GCC Intrinsic version, change the first line to `__v2df xmm0 = {0, 0};` and replace each `ss` with `sd` (warning: cuts the speed in half). I don't see a `_mm_cvt_si2sd()` in Intel's [Intrinsics Guide](https://software.intel.com/sites/landingpage/IntrinsicsGuide) for some reason. – Gumby The Green May 10 '19 at 02:24
  • Are you sure that `cvtsi2ss %1` is actually safe? Assuming `%1` is `edi`, then there could be some bits in upper part of `rdi`. Maybe you wouldn't notice it in your own tests, but it could happen in principle. – Jan Schultke Aug 17 '20 at 18:52
  • @J.Schultke Good catch I'll initialize `i` – MCCCS Aug 17 '20 at 19:48
3

In many cases, even exact integer sqrt value is not needed, enough having good approximation of it. (For example, it often happens in DSP optimization, when 32-bit signal should be compressed to 16-bit, or 16-bit to 8-bit, without loosing much precision around zero).

I've found this useful equation:

k = ceil(MSB(n)/2); - MSB(n) is the most significant bit of "n"


sqrt(n) ~= 2^(k-2)+(2^(k-1))*n/(2^(2*k))); - all multiplications and divisions here are very DSP-friendly, as they are only 2^k.

This equation generates smooth curve (n, sqrt(n)), its values are not very much different from real sqrt(n) and thus can be useful when approximate accuracy is enough.

Waldo Jeffers
  • 2,289
  • 1
  • 15
  • 19
Andrey
  • 31
  • 1
  • Thanks for sharing. This is very useful on μcontrollers that do not have a floating point unit. – Bram Apr 19 '22 at 05:09
2

Why nobody suggests the quickest method?

If:

  1. the range of numbers is limited
  2. memory consumption is not crucial
  3. application launch time is not critical

then create int[MAX_X] filled (on launch) with sqrt(x) (you don't need to use the function sqrt() for it).

All these conditions fit my program quite well. Particularly, an int[10000000] array is going to consume 40MB.

What's your thoughts on this?

Mihai Iorga
  • 39,330
  • 16
  • 106
  • 107
flybot1
  • 37
  • 2
1

On my computer with gcc, with -ffast-math, converting a 32-bit integer to float and using sqrtf takes 1.2 s per 10^9 ops (without -ffast-math it takes 3.54 s).

The following algorithm uses 0.87 s per 10^9 at the expense of some accuracy: errors can be as much as -7 or +1 although the RMS error is only 0.79:

uint16_t SQRTTAB[65536];

inline uint16_t approxsqrt(uint32_t x) { 
  const uint32_t m1 = 0xff000000;
  const uint32_t m2 = 0x00ff0000;
  if (x&m1) {
    return SQRTTAB[x>>16];
  } else if (x&m2) {
    return SQRTTAB[x>>8]>>4;
  } else {
    return SQRTTAB[x]>>8;
  }
}

The table is constructed using:

void maketable() {
  for (int x=0; x<65536; x++) {
    double v = x/65535.0;
    v = sqrt(v);
    int y = int(v*65535.0+0.999);
    SQRTTAB[x] = y;
  }
}

I found that refining the bisection using further if statements does improve accuracy, but it also slows things down to the point that sqrtf is faster, at least with -ffast-math.

DanielW
  • 19
  • 1
1

Or just do a binary search, cant write a simpler version imo:

uint16_t sqrti(uint32_t num)
{
    uint16_t ret = 0;
    for(int32_t i = 15; i >= 0; i--)
    {
        uint16_t temp = ret | (1 << i);
        if(temp * temp <= num)
        {
            ret = temp;
        }
    }
    return ret;
}
Jonathan S
  • 11
  • 1
1

This is an addition for those in need of a precide square root for very large integers. The trick is to leverage the fast floating point square root of modern processors and to fix round-off errors.

template <typename T>
T preciseIntegerSqrt(T n)
{
    if (sizeof(T) <= 4)
    {
        return std::sqrt((double)n);
    }
    else if (sizeof(T) <= 8)
    {
        T r = std::sqrt((double)n);
        return r - (r*r >= n+1);
    }
    else
    {
        if (n == 0) return 0;
        T r = 0;
        for (T b = (T(1)) << ((std::bit_width(n)-1) / 2); b != 0; b >>= 1)
        {
            T const k = (b + 2*r) * b;
            r |= (n >= k) * b;
            n -= (n >= k) * k;
        }
        return r;
    }
}

Explanation: Integers of up to 32 bits do not need a correction, since they can be represented precisely as double-precision floating point numbers. 64-bit integers get along with a very cheap correction. For the general case, refer to Jan Schultke's excellent answer. The code provided here is very slightly faster that that one (10% on my machine, may vary with integer type and hardware).

tglas
  • 949
  • 10
  • 19
  • This is a good approach. But when T is unsigned, the middle case also needs a check for n == 0. Otherwise r*r-1 overflows and makes the wrong correction. – Tom 7 Jun 19 '23 at 03:19
  • @Tom7: Thanks, good catch. As a fix, I simply moved the `1` to the right hand side. – tglas Jun 20 '23 at 05:55
1

I tried all of the exact solutions in here. Using the built-in sqrt on doubles and correcting is by far the fastest on my machine (Threadripper 2; about 10x faster than the integer versions). But the fastest pure integer version was actually this recursive one:

uint64_t Sqrt64(uint64_t xx) {
  if (xx <= 1) return xx;
  uint64_t z = xx >> 2;
  uint64_t r2 = 2 * Sqrt64(z);
  uint64_t r3 = r2 + 1;
  return (xx < r3 * r3) ? r2 : r3;
}

This approach is very similar to several others in here, although I think less mysterious. (The idea is that if you get the square root of the number divided by 4, twice that will be close to the square root of the target number, as 2*2 = 4. But you may need to add one to deal with the round-off error.) I was very surprised that a recursive implementation would beat a loop!

As a bonus, this one is proved correct here.

Tom 7
  • 507
  • 3
  • 10
0

If you need performance on computing square root, I guess you will compute a lot of them. Then why not caching the answer? I don't know the range for N in your case, nor if you will compute many times the square root of the same integer, but if yes, then you can cache the result each time your method is called (in an array would be the most efficient if not too large).

Benoit Thiery
  • 6,325
  • 4
  • 22
  • 28