How does this float square root approximation work?

Question

I found a rather strange but working square root approximation for floats; I really don't get it. Can someone explain me why this code works?

float sqrt(float f)
{
    const int result = 0x1fbb4000 + (*(int*)&f >> 1);
    return *(float*)&result;   
}

I've test it a bit and it outputs values off of std::sqrt() by about 1 to 3%. I know of the Quake III's fast inverse square root and I guess it's something similar here (without the newton iteration) but I'd really appreciate an explanation of how it works.

(nota: I've tagged it both c and c++ since it's both valid-ish (see comments) C and C++ code)

It's not valid C nor valid C++. It violates aliasing rules and it assumes a particular representation for floating-point values and for `int` values. That makes it hackerhead code, which is sometimes intriguing but generally not to be emulated. — Pete Becker, Mar 30 '17 at 14:03
This is some kind of friend of the [other magic number](https://en.wikipedia.org/wiki/Fast_inverse_square_root) `0x5f3759df` — Eugene Sh., Mar 30 '17 at 14:03
sounds like "Approximations that depend on the floating point representation" as described in https://en.wikipedia.org/wiki/Methods_of_computing_square_roots — Jean-François Fabre, Mar 30 '17 at 14:03
Approximately speaking, right-shifting the bitwise representation of `f` divides the exponent by two, which is equivalent to taking the square root. Everything else is presumably magic to improve accuracy across the mantissa range. — Oliver Charlesworth, Mar 30 '17 at 14:06
*divides the exponent by two, which is equivalent to taking the square root* what — Fureeish, Mar 30 '17 at 14:10
@BoundaryImposition it may be some approximation of square root but obviously simply dividing the exponent is not what square root means. Goswin von Brederlow's comment is more correct — phuclv, Mar 30 '17 at 14:56
@LưuVĩnhPhúc: For positive numbers, it literally is. sqrt(x) is x^0.5 if x is positive. That's a strict and unbreakable equivalence, not "some approximation". Granted things get more complicated for other xs. — Lightness Races in Orbit, Mar 30 '17 at 15:01
@BoundaryImposition but this is about IEEE-754 value in `m * e^x` format, not an arbitrary x number — phuclv, Mar 30 '17 at 15:04
@EugeneSh. The article mentions that the true magic number for that method is `0x5f375a86`. — Random832, Mar 31 '17 at 04:36
@PeteBecker: It is absolutely valid C and C++ code. However, it's behaviour is implementation defined. Do not confuse invalid and non-portable; they are not the same thing. — Jack Aidley, Mar 31 '17 at 11:35
@JackAidley -- its behavior is undefined because it violates the strict aliasing rules. It's the kind of code that people wrote 30 years ago, back when C coding was wild and woolly, but the language definition has changed since then. And note that "implementation defined" means that the implementation must document what it does. There is no such requirement for this kind of cast. — Pete Becker, Mar 31 '17 at 11:48
@YSC This [question](http://stackoverflow.com/questions/32042673/optimized-low-accuracy-approximation-to-rootnx-n) shows the general construction principle behind this kind of approximation. — njuffa, Apr 01 '17 at 16:25

score 75 · Accepted Answer · edited May 23 '17 at 10:31

75

(*(int*)&f >> 1) right-shifts the bitwise representation of f. This almost divides the exponent by two, which is approximately equivalent to taking the square root.¹

Why almost? In IEEE-754, the actual exponent is e - 127.² To divide this by two, we'd need e/2 - 64, but the above approximation only gives us e/2 - 127. So we need to add on 63 to the resulting exponent. This is contributed by bits 30-23 of that magic constant (0x1fbb4000).

I'd imagine the remaining bits of the magic constant have been chosen to minimise the maximum error across the mantissa range, or something like that. However, it's unclear whether it was determined analytically, iteratively, or heuristically.

It's worth pointing out that this approach is somewhat non-portable. It makes (at least) the following assumptions:

The platform uses single-precision IEEE-754 for float.
The endianness of float representation.
That you will be unaffected by undefined behaviour due to the fact this approach violates C/C++'s strict-aliasing rules.

Thus it should be avoided unless you're certain that it gives predictable behaviour on your platform (and indeed, that it provides a useful speedup vs. sqrtf!).

_{1. sqrt(a^b) = (a^b)^0.5 = a^(b/2)}

_{2. See e.g. https://en.wikipedia.org/wiki/Single-precision_floating-point_format#Exponent_encoding}

edited May 23 '17 at 10:31

Community

1
1

answered Mar 30 '17 at 14:14

Oliver Charlesworth

267,707
33
569
680

2

It can also be a consequence of [Math Goblins](http://www.smbc-comics.com/comic/monty-hall-problems) :) – Mar 30 '17 at 14:23
@SembeiNorimaki Or [nasal demons](http://www.catb.org/jargon/html/N/nasal-demons.html)? – YSC Mar 30 '17 at 14:25
I didn't downvote, but this answer _really_ needs to point out how non-portable this "hack" is – Lightness Races in Orbit Mar 30 '17 at 14:47
@BoundaryImposition - Done. – Oliver Charlesworth Mar 30 '17 at 14:50
That's better :) – Lightness Races in Orbit Mar 30 '17 at 14:50
6

IMO, the probability of hitting these "non-portabilities" is close to zero. IEEE-754 is universally adopted, machines with different integer and floating-point endianness are (were) exceptional. – Mar 30 '17 at 15:04
2

@YvesDaoust - I've worked with IBM floating-point format once before ;) Though IMO, the strict-aliasing issue would be the most likely to occur in practice. – Oliver Charlesworth Mar 30 '17 at 15:05
@OliverCharlesworth: so did I, but not down to the specifics of the representation. Also VAXen. – Mar 30 '17 at 15:08
@YvesDaoust: It's still worth knowing that these facts are not guaranteed by C++. – Lightness Races in Orbit Mar 30 '17 at 15:18
1

@BoundaryImposition: why is this worth ? – Mar 30 '17 at 15:22
4

Nice answer. Detail: "The endianness of float representation." is relative to the endianness of the `int`. If they are both big or both little, then endianness is not an issue. – chux - Reinstate Monica Mar 30 '17 at 15:29
@chux - I think it goes slightly further than that, in the sense that (presumably) a float could have its bits/components stored in any order at all, theoretically. Or to put it another way, I'm not sure that "big/little-endian" is a well-defined concept for floats. – Oliver Charlesworth Mar 30 '17 at 15:46
@YvesDaoust: Because, well, you know what they say about assumptions. Especially in computer programs. Having knowledge is better than not having knowledge, isn't it? Or would you rather deliberately withhold this information because you think the probability of it mattering in general is low? Even though we don't know anything about the domain in which individual SO readers work? That seems very irresponsible, to me. – Lightness Races in Orbit Mar 30 '17 at 15:52
@BoundaryImposition: you are right. Asking such a question, the OP might be working with some obscure embedded platform built around a 16 bits controller. I'd better withdraw my earlier comments. – Mar 30 '17 at 15:55
@YvesDaoust: It's also future visitors we have to be mindful of. – Lightness Races in Orbit Mar 30 '17 at 15:58
@OliverCharlesworth: The proper way to handle aliasing is simply to recognize the existence of dialects of C with different aliasing rules ranging from gcc (ignore aliasing even when the Standard would mandate recognition, as with the Common Initial Sequence rule), C99, C89, common-sense (what the authors of C89 probably had in mind as being typical for quality general-purpose implementations), and precise (always behave exactly as though all accesses are volatile). This code would work with the last two. – supercat Mar 30 '17 at 23:28
4

This *does* violate the strict aliasing rule, unquestionably, in both C and C++. "That you won't fall foul of C/C++'s strict-aliasing rules." suggests that it may or may not. It is well known that modern compilers aggressively perform TBAA, the trail of history is littered with the carcasses of people who thought " the probability of hitting these non-portabilities is close to zero". I would like to see the answer clearly state that it does violate the rule, and OP should either modify the code or invoke the compiler with TBAA disabled (gcc and clang have a switch for this) – M.M Mar 31 '17 at 00:32
4

Another portability issue not mentioned yet is that right-shifting a negative `int` produces an implementation-defined value – M.M Mar 31 '17 at 00:39
1

@M.M Absolutely true, albeit mitigated because a negative `int` should only be generated by a negative `float` if the endianness assumption holds, and passing a negative number to a real-valued square root function is a logic error anyway. Might be an issue for `sqrt(-0.0F)`? This version does not check its input, though! Another is that the addition could potentially overflow. Yet another is that `int` and `float` might not be the same size. I address these in my own answer. – Davislor Mar 31 '17 at 06:21

Davislor · Answer 2 · 2022-01-07T21:49:27.350

See Oliver Charlesworth’s explanation of why this almost works. I’m addressing an issue raised in the comments.

Since several people have pointed out the non-portability of this, here are some ways you can make it more portable, or at least make the compiler tell you if it won’t work.

First, C++ allows you to check std::numeric_limits<float>::is_iec559 at compile time, such as in a static_assert. You can also check that sizeof(int) == sizeof(float), which will not be true if int is 64-bits, but what you really want to do is use uint32_t, which if it exists will always be exactly 32 bits wide, will have well-defined behavior with shifts and overflow, and will cause a compilation error if your weird architecture has no such integral type. Either way, you should also static_assert() that the types have the same size. Static assertions have no run-time cost and you should always check your preconditions this way if possible.

Unfortunately, the test of whether converting the bits in a float to a uint32_t and shifting is big-endian, little-endian or neither cannot be computed as a compile-time constant expression. Here, I put the run-time check in the part of the code that depends on it, but you might want to put it in the initialization and do it once. In practice, both gcc and clang can optimize this test away at compile time.

You do not want to use the unsafe pointer cast, and there are some systems I’ve worked on in the real world where that could crash the program with a bus error. The maximally-portable way to convert object representations is with memcpy(). In my example below, I type-pun with a union, which works on any actually-existing implementation. (Language lawyers object to it, but no successful compiler will ever break that much legacy code silently.) If you must do a pointer conversion (see below) there is alignas(). But however you do it, the result will be implementation-defined, which is why we check the result of converting and shifting a test value.

Anyway, not that you’re likely to use it on a modern CPU, here’s a gussied-up C++14 version that checks those non-portable assumptions:

#include <cassert>
#include <cmath>
#include <cstdint>
#include <cstdlib>
#include <iomanip>
#include <iostream>
#include <limits>
#include <vector>

using std::cout;
using std::endl;
using std::size_t;
using std::sqrt;
using std::uint32_t;

template <typename T, typename U>
  inline T reinterpret(const U x)
/* Reinterprets the bits of x as a T.  Cannot be constexpr
 * in C++14 because it reads an inactive union member.
 */
{
  static_assert( sizeof(T)==sizeof(U), "" );
  union tu_pun {
    U u = U();
    T t;
  };
  
  const tu_pun pun{x};
  return pun.t;
}

constexpr float source = -0.1F;
constexpr uint32_t target = 0x5ee66666UL;

const uint32_t after_rshift = reinterpret<uint32_t,float>(source) >> 1U;
const bool is_little_endian = after_rshift == target;

float est_sqrt(const float x)
/* A fast approximation of sqrt(x) that works less well for subnormal numbers.
 */
{
  static_assert( std::numeric_limits<float>::is_iec559, "" );
  assert(is_little_endian); // Could provide alternative big-endian code.
  
 /* The algorithm relies on the bit representation of normal IEEE floats, so
  * a subnormal number as input might be considered a domain error as well?
  */
  if ( std::isless(x, 0.0F) || !std::isfinite(x) )
    return std::numeric_limits<float>::signaling_NaN();
  
  constexpr uint32_t magic_number = 0x1fbb4000UL;
  const uint32_t raw_bits = reinterpret<uint32_t,float>(x);
  const uint32_t rejiggered_bits = (raw_bits >> 1U) + magic_number;
  return reinterpret<float,uint32_t>(rejiggered_bits);
}

int main(void)
{  
  static const std::vector<float> test_values{
    4.0F, 0.01F, 0.0F, 5e20F, 5e-20F, 1.262738e-38F };
  
  for ( const float& x : test_values ) {
    const double gold_standard = sqrt((double)x);
    const double estimate = est_sqrt(x);
    const double error = estimate - gold_standard;
    
    cout << "The error for (" << estimate << " - " << gold_standard << ") is "
         << error;

    if ( gold_standard != 0.0 && std::isfinite(gold_standard) ) {
      const double error_pct = error/gold_standard * 100.0;
      cout << " (" << error_pct << "%).";
    } else
      cout << '.';

    cout << endl;
  }

  return EXIT_SUCCESS;
}

Update

Here is an alternative definition of reinterpret<T,U>() that avoids type-punning. You could also implement the type-pun in modern C, where it’s allowed by standard, and call the function as extern "C". I think type-punning is more elegant, type-safe and consistent with the quasi-functional style of this program than memcpy(). I also don’t think you gain much, because you still could have undefined behavior from a hypothetical trap representation. Also, clang++ 3.9.1 -O -S is able to statically analyze the type-punning version, optimize the variable is_little_endian to the constant 0x1, and eliminate the run-time test, but it can only optimize this version down to a single-instruction stub.

But more importantly, this code isn’t guaranteed to work portably on every compiler. For example, some old computers can’t even address exactly 32 bits of memory. But in those cases, it should fail to compile and tell you why. No compiler is just suddenly going to break a huge amount of legacy code for no reason. Although the standard technically gives permission to do that and still say it conforms to C++14, it will only happen on an architecture very different from we expect. And if our assumptions are so invalid that some compiler is going to turn a type-pun between a float and a 32-bit unsigned integer into a dangerous bug, I really doubt the logic behind this code will hold up if we just use memcpy() instead. We want that code to fail at compile time, and to tell us why.

#include <cassert>
#include <cstdint>
#include <cstring>

using std::memcpy;
using std::uint32_t;

template <typename T, typename U> inline T reinterpret(const U &x)
/* Reinterprets the bits of x as a T.  Cannot be constexpr
 * in C++14 because it modifies a variable.
 */
{
  static_assert( sizeof(T)==sizeof(U), "" );
  T temp;
  
  memcpy( &temp, &x, sizeof(T) );
  return temp;
}

constexpr float source = -0.1F;
constexpr uint32_t target = 0x5ee66666UL;

const uint32_t after_rshift = reinterpret<uint32_t,float>(source) >> 1U;
extern const bool is_little_endian = after_rshift == target;

However, Stroustrup et al., in the C++ Core Guidelines, recommend a reinterpret_cast instead:

#include <cassert>

template <typename T, typename U> inline T reinterpret(const U x)
/* Reinterprets the bits of x as a T.  Cannot be constexpr
 * in C++14 because it uses reinterpret_cast.
 */
{
  static_assert( sizeof(T)==sizeof(U), "" );
  const U temp alignas(T) alignas(U) = x;
  return *reinterpret_cast<const T*>(&temp);
}

The compilers I tested can also optimize this away to a folded constant. Stroustrup’s reasoning is [sic]:

Accessing the result of an reinterpret_cast to a different type from the objects declared type is still undefined behavior, but at least we can see that something tricky is going on.

Update

From the comments: C++20 introduces std::bit_cast, which converts an object representation to a different type with unspecified, not undefined, behavior. This doesn’t guarantee that your implementation will use the same format of float and int that this code expects, but it doesn’t give the compiler carte blanche to break your program arbitrarily because there’s technically undefined behavior in one line of it. It can also give you a constexpr conversion.

In C++ it's undefined behaviour to read a different member of a union than the one last written (see [this answer](http://stackoverflow.com/a/11996970/1505939), particularly the final paragraph) — M.M, Mar 31 '17 at 00:36
Edited that paragraph. Yes, it is officially a language extension that every major compiler just happens to support. If you really want IDB instead of UB, use `memcpy()`. You still risk getting a trap representaiton. I think the code I wrote is safer and more elegant than `memcpy()`, though. It’s type-safe, it’s pure functional-style code where no variable is modified, and it can be statically analyzed (even constant-folded). There is a separate check later that the results are what we expect. And if type-punning is verboten, in the future, any sane compiler will give us a compile-time error. — Davislor, Mar 31 '17 at 01:45
@M.M Added a version that uses `memcpy()` and an explanation of the issue you raise. — Davislor, Mar 31 '17 at 06:17
@Davislor: Regarding the union issue, I would note that Undefined Behavior leaves the implementation free to do whatever they want; the implementation guaranteeing that type-punning via unions work is well within the perimeter of "whatever they want", and therefore if you have a guarantee for your compilers of interest, you are good to go. — Matthieu M., Mar 31 '17 at 08:12
@MatthieuM, It’s a controversial opinion, and Stroustrup’s C++ core guidelines recommend dereferencing a `reinterpret_cast` on a pointer instead. (I’ve edited my answer to give a citation and example.) But basically, no compiler’s going to break the type-punning idiom for no reason, in a case as trivial as punning between an arithmetic type and an unsigned integer the same size. If that doesn’t do what we expect, our basic assumptions about the architecture don’t hold. And in that case, I want the compiler to flag this as a logic error at compile time. — Davislor, Mar 31 '17 at 09:04
@MatthieuM. Every C compiler this century has had to support type-punning through unions, so every C++ compiler I’ve tested can handle it. In practice, more compilers have problems with pointer aliases of different types than inactive members of POD unions. (I tried in my example to simplify the job of a static code analyzer as much as possible: the aliased variable is a constant, set at initialization.) I prefer the union version myself, which most cleanly expresses what I want to do, but this is inherently non-portable code. What we can do is sanity-check it. — Davislor, Mar 31 '17 at 09:25
@Davislor: Which is exactly my point; if your compiler(s) make(s) the guarantee for you, sure it's not portable... but it works for you :D And yes, most compilers do allow it "directly" (avoid passing pointers to the members to opaque functions). — Matthieu M., Mar 31 '17 at 09:28
@MatthieuM. You’ll notice that all variables in my program are constants (technically, with the exception of the contents of STL containers), which completely eliminates bugs caused by the optimizer not updating a different alias’ view of a variable when it changes. Not always an option, but I always try to do it when I can. — Davislor, Mar 31 '17 at 09:37
The second half of C.183 is complete nonsense. It's perfectly valid to `reinterpret_cast` to `unsigned char`, `char`, or in C++17 `std::byte` and access the object representation; there's an exception to the strict aliasing rule for those types. It's not clear whether they intended the suggestion to extend to other types; if they did, such advice is dubious at best. — T.C., Apr 03 '17 at 07:05
@T.C. To be clear, Stroustrup said to use `reinterpret_cast` instead of type-punning through a `union`. Even copying the bytes of the object representation, which is what the `memcpy()` version does, only is defined if both types are trivially copyable. (I should probably have made the code check that, too.) Using the `float` we get back is still undefined behavior, because we might in theory have gotten a trap representation, although the checks we did that it’s an IEEE 32-bit float with the same endianness as a `uint32_t` *should* stop that from happening in this case. — Davislor, Apr 03 '17 at 07:26
In C++20, the type reinterpretation can be done via [std::bit_cast](https://en.cppreference.com/w/cpp/numeric/bit_cast). — pigi5, Jan 07 '22 at 20:48

Michael Foukarakis · Answer 3 · 2017-03-31T06:35:41.193

8

Let y = sqrt(x),

it follows from the properties of logarithms that log(y) = 0.5 * log(x) (1)

Interpreting a normal float as an integer gives INT(x) = Ix = L * (log(x) + B - σ) (2)

where L = 2^N, N the number of bits of the significand, B is the exponent bias, and σ is a free factor to tune the approximation.

Combining (1) and (2) gives: Iy = 0.5 * (Ix + (L * (B - σ)))

Which is written in the code as (*(int*)&x >> 1) + 0x1fbb4000;

Find the σ so that the constant equals 0x1fbb4000 and determine whether it's optimal.

edited Mar 31 '17 at 06:35

answered Mar 30 '17 at 15:51

Michael Foukarakis

39,737
6
87
123

1

Note that with common`float`, the MSbit of the significand is not coded, just assumed as 1 for normal `float`. This affects OP's `float sqrt(float f)` yet not accounted for in `INT(x)` – chux - Reinstate Monica Mar 30 '17 at 16:10
Yes, as you noted in your post this approximation is accurate only for normal `float`s. – Michael Foukarakis Mar 30 '17 at 16:17

chux - Reinstate Monica · Answer 4 · 2017-03-30T16:37:40.627

6

Adding a wiki test harness to test all float.

The approximation is within 4% for many float, but very poor for sub-normal numbers. YMMV

Worst:1.401298e-45 211749.20%
Average:0.63%
Worst:1.262738e-38 3.52%
Average:0.02%

Note that with argument of +/-0.0, the result is not zero.

printf("% e % e\n", sqrtf(+0.0), sqrt_apx(0.0));  //  0.000000e+00  7.930346e-20
printf("% e % e\n", sqrtf(-0.0), sqrt_apx(-0.0)); // -0.000000e+00 -2.698557e+19

Test code

#include <float.h>
#include <limits.h>
#include <math.h>
#include <stddef.h>
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>

float sqrt_apx(float f) {
  const int result = 0x1fbb4000 + (*(int*) &f >> 1);
  return *(float*) &result;
}

double error_value = 0.0;
double error_worst = 0.0;
double error_sum = 0.0;
unsigned long error_count = 0;

void sqrt_test(float f) {
  if (f == 0) return;
  volatile float y0 = sqrtf(f);
  volatile float y1 = sqrt_apx(f);
  double error = (1.0 * y1 - y0) / y0;
  error = fabs(error);
  if (error > error_worst) {
    error_worst = error;
    error_value = f;
  }
  error_sum += error;
  error_count++;
}

void sqrt_tests(float f0, float f1) {
  error_value = error_worst = error_sum = 0.0;
  error_count = 0;
  for (;;) {
    sqrt_test(f0);
    if (f0 == f1) break;
    f0 = nextafterf(f0, f1);
  }
  printf("Worst:%e %.2f%%\n", error_value, error_worst*100.0);
  printf("Average:%.2f%%\n", error_sum / error_count);
  fflush(stdout);
}

int main() {
  sqrt_tests(FLT_TRUE_MIN, FLT_MIN);
  sqrt_tests(FLT_MIN, FLT_MAX);
  return 0;
}

edited Mar 30 '17 at 16:37

answered Mar 30 '17 at 15:59

chux - Reinstate Monica

143,097
13
135
256

Real question- assuming we can work in the 'bigger numbers' section, how does it compare timing-wise to `sqrtf()`? Could it be a fast approximation? I could see worth in real-time physics simulations if we need 'okayish' approximations but a sqrt being off by 0.02% on average being fine if it's quick. – Delioth Mar 30 '17 at 21:14
@Delioth Could it be a fast approximation? Of course it could, it could also be slower. With a processor w/o FP math, certainly `sqrt_apx()` is faster. With an advanced processor, perhaps faster given optimized code and parallel processing. One needs to set up a specific situation. Do recall, that the `sqrt_apx(0.0)` is not 0 and that could create real problems. It is all very case dependent. Perhaps you could try a simulation and post your results? – chux - Reinstate Monica Mar 30 '17 at 21:45
1

You do not need to test all floats! For normalized numbers, you only need to test two consecutive binades B_1 and B_2. What happens in binade B_(n+2) is isomorphic to what happens in binade B_n (note that when `f` is moved two binades up, `*(int*)&f >> 1` is moved one binade up). – Pascal Cuoq Mar 31 '17 at 08:27
@PascalCuoq Certainly few `float` need to be tested, yet testing them all is not too time consuming. With `double`, your idea does have even stronger merit given the 2^64 combinations. Although `B_1, & B_2` are important, additional select FP values that cause the sum with `0x1fbb4000` to changes powers-of-2 deserve assessment too. – chux - Reinstate Monica Mar 31 '17 at 13:31
@PascalCuoq Feel open to apply your improvements to the answer. – chux - Reinstate Monica Mar 31 '17 at 13:32
Hm.. with a little research, it seems like many hardware implementations have an actual sqrt branch, so it may be unlikely that this would be fast. Though I'm unsure of whether GPUs have a similar branch, it might be worthwhile. Sometime I might come back to this if I get really bored and start playing with CUDA or something, but as of yet I agree that it's implementation specific. – Delioth Mar 31 '17 at 13:57

How does this float square root approximation work?

4 Answers4

Update

Update

Linked