136

John Carmack has a special function in the Quake III source code which calculates the inverse square root of a float, 4x faster than regular (float)(1.0/sqrt(x)), including a strange 0x5f3759df constant. See the code below. Can someone explain line by line what exactly is going on here and why this works so much faster than the regular implementation?

float Q_rsqrt( float number )
{
  long i;
  float x2, y;
  const float threehalfs = 1.5F;

  x2 = number * 0.5F;
  y  = number;
  i  = * ( long * ) &y;
  i  = 0x5f3759df - ( i >> 1 );
  y  = * ( float * ) &i;
  y  = y * ( threehalfs - ( x2 * y * y ) );

  #ifndef Q3_VM
  #ifdef __linux__
    assert( !isnan(y) );
  #endif
  #endif
  return y;
}
David Foerster
  • 1,461
  • 1
  • 14
  • 23
Alex
  • 75,813
  • 86
  • 255
  • 348
  • 11
    This has been written about zillions of times. See: http://www.google.com/search?q=0x5f3759df – Greg Hewgill Aug 28 '09 at 21:47
  • 17
    Thanks, though. This was a much more interesting question than "how do you make a positive number negative in C#?" – MusiGenesis Aug 28 '09 at 21:59
  • 11
    Wasn't Carmack. http://en.wikipedia.org/wiki/Fast_inverse_square_root – h4xxr Aug 28 '09 at 22:28
  • 7
    [Here's an explanation](http://www.beyond3d.com/content/articles/8/) – sepp2k Aug 28 '09 at 21:47
  • 1
    In this line `i = * ( long * ) &y;` why is the address of y taken as a pointer to a long then dereferenced again? – Nubcake Jul 27 '17 at 14:50
  • 1
    @Nubcake: because `y` is a `float`, and this is type-punning it to an integer. (Unsafely, because it violates C's strict-aliasing rules. A `union` in C99, or `memcpy` in C89 / C++ would do the same thing following the language rules, and compile the same at least with modern optimizing compilers.) – Peter Cordes Dec 14 '17 at 03:24

6 Answers6

85

FYI. Carmack didn't write it. Terje Mathisen and Gary Tarolli both take partial (and very modest) credit for it, as well as crediting some other sources.

How the mythical constant was derived is something of a mystery.

To quote Gary Tarolli:

Which actually is doing a floating point computation in integer - it took a long time to figure out how and why this works, and I can't remember the details anymore.

A slightly better constant, developed by an expert mathematician (Chris Lomont) trying to work out how the original algorithm worked is:

float InvSqrt(float x)
{
    float xhalf = 0.5f * x;
    int i = *(int*)&x;              // get bits for floating value
    i = 0x5f375a86 - (i >> 1);      // gives initial guess y0
    x = *(float*)&i;                // convert bits back to float
    x = x * (1.5f - xhalf * x * x); // Newton step, repeating increases accuracy
    return x;
}

In spite of this, his initial attempt a mathematically 'superior' version of id's sqrt (which came to almost the same constant) proved inferior to the one initially developed by Gary despite being mathematically much 'purer'. He couldn't explain why id's was so excellent iirc.

DainDwarf
  • 1,651
  • 10
  • 19
Rushyo
  • 7,495
  • 4
  • 35
  • 42
  • 7
    What is "mathematically purer" supposed to mean? – Tara Jul 30 '15 at 01:06
  • 2
    I would imagine where the first guess can be derived from justifiable constants, rather than being seemingly arbitrary. Although if you want a technical description, you can look it up. I'm not a mathematician, and a semantic discussion about mathematical terminology doesn't belong on SO. – Rushyo Jul 30 '15 at 10:59
  • 11
    That is *exactly* the reason I encapsulated that word in scare quotes, to avert this sort of nonsense. That assumes the reader is familiar with colloquial English writing, I guess. You'd think common sense would be sufficient. I didn't use a vague term because I thought "you know what, I really want to be queried on this by someone who can't be bothered to look up the original source which would take two seconds on Google". – Rushyo Jul 31 '15 at 12:32
  • 2
    Well, you actually haven't answered the question. – BJovke Feb 13 '17 at 09:10
  • 1
    For those who wanted to know where he finds it: https://www.beyond3d.com/content/articles/8/ – mr5 Jun 27 '17 at 15:51
  • 5
    A good explanation for why the optimal first guess was worse after the Newton Rapheson is that an overestimate converges to the result slower than an underestimate as shown in this thesis: https://cs.uwaterloo.ca/~m32rober/rsqrt.pdf – EdL Mar 01 '19 at 14:07
  • 1
    You are misquoting the paper and creating drama where none existed. Lomont's paper clearly explains the algorithm being used and how it was improved. – johnwbyrd Jan 16 '20 at 22:38
61

Of course these days, it turns out to be much slower than just using an FPU's sqrt (especially on 360/PS3), because swapping between float and int registers induces a load-hit-store, while the floating point unit can do reciprocal square root in hardware.

It just shows how optimizations have to evolve as the nature of underlying hardware changes.

Crashworks
  • 40,496
  • 12
  • 101
  • 170
  • 6
    It's still a lot faster than std::sqrt() though. – Tara Aug 02 '15 at 02:16
  • 2
    Do you have a source? I want to test the runtimes but I don't have an Xbox 360 development kit. – DucRP Dec 16 '16 at 06:20
  • 2
    Well, now there is rsqrt in the intel processor. I.e. the sse instruction _mm_rsqrt_ss, and it's faster still there. – aselle Jul 03 '21 at 03:21
45

Greg Hewgill and IllidanS4 gave a link with excellent mathematical explanation. I'll try to sum it up here for ones who don't want to go too much into details.

Any mathematical function, with some exceptions, can be represented by a polynomial sum:

y = f(x)

can be exactly transformed into:

y = a0 + a1*x + a2*(x^2) + a3*(x^3) + a4*(x^4) + ...

Where a0, a1, a2,... are constants. The problem is that for many functions, like square root, for exact value this sum has infinite number of members, it does not end at some x^n. But, if we stop at some x^n we would still have a result up to some precision.

So, if we have:

y = 1/sqrt(x)

In this particular case they decided to discard all polynomial members above second, probably because of calculation speed:

y = a0 + a1*x + [...discarded...]

And the task has now came down to calculate a0 and a1 in order for y to have the least difference from the exact value. They have calculated that the most appropriate values are:

a0 = 0x5f375a86
a1 = -0.5

So when you put this into equation you get:

y = 0x5f375a86 - 0.5*x

Which is the same as the line you see in the code:

i = 0x5f375a86 - (i >> 1);

Edit: actually here y = 0x5f375a86 - 0.5*x is not the same as i = 0x5f375a86 - (i >> 1); since shifting float as integer not only divides by two but also divides exponent by two and causes some other artifacts, but it still comes down to calculating some coefficients a0, a1, a2... .

At this point they've found out that this result's precision is not enough for the purpose. So they additionally did only one step of Newton's iteration to improve the result accuracy:

x = x * (1.5f - xhalf * x * x)

They could have done some more iterations in a loop, each one improving result, until required accuracy is met. This is exactly how it works in CPU/FPU! But it seems that only one iteration was enough, which was also a blessing for the speed. CPU/FPU does as many iterations as needed to reach the accuracy for the floating point number in which the result is stored and it has more general algorithm which works for all cases.


So in short, what they did is:

Use (almost) the same algorithm as CPU/FPU, exploit the improvement of initial conditions for the special case of 1/sqrt(x) and don't calculate all the way to precision CPU/FPU will go to but stop earlier, thus gaining in calculation speed.

BJovke
  • 1,737
  • 15
  • 18
  • 2
    Casting the pointer to long is an approximation of log_2(float). Casting it back is an approximation of 2^long. This means that you can make the ratio approximately linear. – wizzwizz4 Sep 01 '17 at 17:29
  • This is the clearest explanation I've heard yet – user3724404 Oct 17 '21 at 19:07
31

I was curious to see what the constant was as a float so I simply wrote this bit of code and googled the integer that popped out.

long i = 0x5F3759DF;
float* fp = (float*)&i;
printf("(2^127)^(1/2) = %f\n", *fp);
//Output
//(2^127)^(1/2) = 13211836172961054720.000000

It looks like the constant is "An integer approximation to the square root of 2^127 better known by the hexadecimal form of its floating-point representation, 0x5f3759df" https://mrob.com/pub/math/numbers-18.html

On the same site it explains the whole thing. https://mrob.com/pub/math/numbers-16.html#le009_16

Yun
  • 3,056
  • 6
  • 9
  • 28
  • 12
    This deserves more attention. It all makes sense after realizing that it's just the square root of 2^127... – u8y7541 Apr 05 '18 at 20:11
  • Just for the sake of completeness - The hex is not exactly the `sqrt(2^127)` but a close approximation (upto two MSB digits). `sqrt(2^127) = 1.3043x10^19` while `0x5F3759DF = 1.3211x10^19` – Loves Probability Jan 05 '21 at 13:47
  • Nitpicking this a bit: the code invokes undefined behaviour (aliasing) and has a good chance of not working with modern compilers, even if float has the same size of a long. – Remember Monica Mar 15 '22 at 00:28
25

According to this nice article written a while back...

The magic of the code, even if you can't follow it, stands out as the i = 0x5f3759df - (i>>1); line. Simplified, Newton-Raphson is an approximation that starts off with a guess and refines it with iteration. Taking advantage of the nature of 32-bit x86 processors, i, an integer, is initially set to the value of the floating point number you want to take the inverse square of, using an integer cast. i is then set to 0x5f3759df, minus itself shifted one bit to the right. The right shift drops the least significant bit of i, essentially halving it.

It's a really good read. This is only a tiny piece of it.

Dillie-O
  • 29,277
  • 14
  • 101
  • 140
  • Newton Raphson method mentioned here is similar to gradient descent used in neural networks. The main magic here is the constant. Somehow by using this constant, and a single iteration of Newton Raphson on it was enough to reach the required accuracy. – Harsha Reddy Oct 18 '20 at 07:44
2

The code consists of two major parts. Part one calculates an approximation for 1/sqrt(y), and part two takes that number and runs one iteration of Newton's method to get a better approximation.

Calculating an approximation for 1/sqrt(y)

i  = * ( long * ) &y;
i  = 0x5f3759df - ( i >> 1 );
y  = * ( float * ) &i;

Line 1 takes the floating point representation of y and treats it as an integer i. Line 2 shifts i over one bit and subtracts it from a mysterious constant. Line 3 takes the resulting number and converts it back to a standard float32. Now why does this work?

Let g be a function that maps a floating point number to its floating point representation, read as an integer. Line 1 above is setting i = g(y).

The following good approximation of g exists(*): g(y) ≈ Clog_2 y + D for some constants C and D. An intuition for why such a good approximation exists is that the floating point representation of y is roughly linear in the exponent.

The purpose of line 2 is to map from g(y) to g(1/sqrt(y)), after which line 3 can use g^-1 to map that number to 1/sqrt(y). Using the approximation above, we have g(1/sqrt(y)) ≈ Clog_2 (1/sqrt(y)) + D = -C/2 log_2 y + D. We can use these formulas to calculate the map from g(y) to g(1/sqrt(y)), which is g(1/sqrt(y)) ≈ 3D/2 - 1/2 * g(y). In line 2, we have 0x5f3759df ≈ 3D/2, and i >> 1 ≈ 1/2*g(y).

The constant 0x5f3759df is slightly smaller than the constant that gives the best possible approximation for g(1/sqrt(y)). That is because this step is not done in isolation. Due to the direction that Newton's method tends to miss in, using a slightly smaller constant tends to yield better results. The exact optimal constant to use in this setting depends on your input distribution of y, but 0x5f3759df is one such constant that gives good results over a fairly broad range.

A more detailed description of this process can be found on Wikipedia: https://en.wikipedia.org/wiki/Fast_inverse_square_root#Algorithm

(*) More explicitly, let y = 2^e*(1+f). Taking the log of both sides, we get log_2 y = e + log_2(1+f), which can be approximated as log_2 y ≈ e + f + σ for a small constant sigma. Separately, the float32 encoding of y expressed as an integer is g(y) ≈ 2^23 * (e+127) + f * 2^23. Combining the two equations, we get g(y) ≈ 2^23 * log_2 y + 2^23 * (127 - σ).

Using Newton's method

y  = y * ( threehalfs - ( x2 * y * y ) );

Consider the function f(y) = 1/y^2 - num. The positive zero of f is y = 1/sqrt(num), which is what we are interested in calculating.

Newton's method is an iterative algorithm for taking an approximation y_n for the zero of a function f, and calculating a better approximation y_n+1, using the following equation: y_n+1 = y_n - f(y_n)/f'(y_n).

Calculating what that looks like for our function f gives the following equation: y_n+1 = y_n - (-y_n+y_n^3*num)/2 = y_n * (3/2 - num/2 * y_n * y_n). This is exactly what the line of code above is doing.

You can learn more about the details of Newton's method here: https://en.wikipedia.org/wiki/Newton%27s_method

user35734
  • 143
  • 5