Why does the integer representation of a floating point number offer a piecewise linear approximation to the logarithm?

Question

If you were reading news about developments in graphics in the 1990s, you might have followed Jim Blinn's column in IEEE Computer Graphics & Applications, "Jim Blinn's corner." In the summer of 1997 you would have opened your issue to a column called "Floating-Point Tricks". All of the tricks are based on the following chestnut of knowledge Blinn passes along from Steve Gabriel and Gideon Yuval, two old hands at microsoft:

"If you only deal with positive numbers, the bit pattern of a floating point number, interpreted as an integer, gives a piecewise linear approximation to the logarithm function"

My question is, what is it about floating point numbers which gives rise to this characteristic?

Demonstration


// These are will result in undefined behavior
// and are not the 'right' way to do this
// but I am following Blinn 1997 here
int AsInteger(float f) {
    return * ( int * ) &f;
}

float AsFloat(int i) {
    return * ( float * ) &i;
}

// FISR demonstrating the logaritmic nature of the floating-point numbers
// The constant chosen is not "magic" but that which replaces the lost
// exponents from the shift of the number as an integer.
float BlinnISR(float x) {
    int i;
    float y;
    // 0x5F400000 = 1598029824
    // which is (AsInteger(1.0f) + (AsInteger(1.0f) >> 1))
    i = 0x5F400000 - ( AsInteger(x) >> 1);
    y = AsFloat(i);
    // Blinn uses 1.47 and 0.47 not 1.5 and 0.5 
    // which are what you would get from differentiating
    y = y * (1.47f - 0.47f * x * y * y);
    return y;
}

The above, a recreation of Blinn's example code, shows how it works practically for the inverse square root.

Remember that 1 / sqrt(x) is the same as x^(-1/2). So if we have access to a table of logarithms one way to find the square root is to take the natural log of x^(-1/2) which gives you -1/2 * ln(x). If you exponentiate this you get 1 / sqrt(x) = e^{((-1/2) * ln(x))}.

Right shifting the integer representation of the float bits gives us an approximation of a division by two of the logarithm of the input floating point number (this is our 1/2 * ln(x)). This is subtracted from the restoring constant, here just 0x5F400000, not yet 'magic' so we have essentially -1/2 * ln(x). Returning that to a float serves as our exponentiation step. We then have a number which serves as a good first input into Newton's method.

What it looks like

Visually, you can see the relationship here (Blinn 1997, p. 81):

A similar, idealized graph appears in Mitchell 1961 (p. 513, linked below):

This same relationship is shown in Jean-Michel Muller, ”Elementary Functions and Approximate Computing,” in Proceedings of the IEEE, vol. 108, no. 12, p. 2137, Dec. 2020

Where this applies

This characteristic is not unique to IEEE 754 floating point numbers, though that is where it is most famously exploited in Quake III Arena's fast inverse square root which uses a standard float. However it will work in an abstract format like Knuth's MIX didactic format, or with others, including the below examples.

To the best of my understanding, the relationship was first recorded in the academic press with John Mitchell's 1961 Computer Multiplication and Division Using Binary Logarithms demonstrating the method with an abstract binary logarithm system. William Kahan implemented a similar scheme around 1961 or 1962 on an IBM 7090. We know as well, in part due to Noah Hellman's masters thesis, Mitchell-Based Approximate Operations on Floating-Point Numbers, that it was also implemented in UNIX as the system square root from 1974 until the creation of a libm for BSD 4.3 in the mid 1980s. Around that same time, Kahan and his graduate student K-C Ng produced but did not publish a 'magic' square root using this method designed to operate on not floats but doubles which would be halved and operated on as 32 bit parts (see a related stackoverflow question).

It will not, however, work for all floating point formats. DEC's VAX floating point format interleaves the sign, exponent and fraction portion (which it splits into two) so it won't do.

What I am not asking

I don't care (for this question) about how engineers came to learn about this relationship, or where it has been exploited (cf the FISR). I want to know why it exists. Historical questions I've asked which sort of relate to the topic would be found on the retrocomputing stack exchange: early examples of indexing float bits in C, type punning and C standards/compilers

I'm also not asking how to do it, though a good answer might include an illustrative example with some code. However, you should first take a look at this amazing FISR question on codegolf to see some really interesting ones.

An excellent answer

A bad answer might be "any working engineer with a slide rule would understand this relationship," though I could imagine that a great answer might resort to a slide rule. Another bad answer might be "because Knuth says so," while a very good answer might include that sentiment amidst an explanation which depends on section 4.2.4 of Volume 2 of TAOCP (I don't have page numbers, I have O'Reilly's pagination abomination, it is pp. 240-247 in the print first edition).

An excellent answer should provide a clear understanding of the nature of the relationship and why it arises. A proof that the relationship must be so would be intruiging, though I can forsee some challenges.

Why I am asking

I am writing a history of the Fast Inverse Square Root, information about which is collected at 0x5f37642f.com. Part of investigating this as a historian means asking questions like this from the community. I do not have the expertise to distill, by myself, much of this because I am not a mathematician or a computer scientist.

The FISR has basically three interesting parts: the logarimic approximation, the "magic" constant", and newton's method. Between the three when folks write about the FISR they write about the magic constant, then newton's method, then spend very little time talking about the approximation. I get the feeling from the broader 'literature' that the approximation is widely known but not as clearly well understood. What I learn from answers will help me understand it.

An excellent answer will be noted in the acknowledgements of papers or talks which come out of this work.

I'm also asking because c'mon, no one has asked this yet and we've been here HOW long?

Related questions and answers on stackoverflow

Well, let's say a floating point number comprises an exponent and a mantissa, with the exponent being a base-2 logarithm and the mantissa bits concatenated to it. If you imagine a decimal point after the exponent, i.e., divide by 2 to the number of mantissa bits, you have something that equals (floor(log2(x))).(stuff) where stuff goes from 0 to 1 -- I think that's your piecewise linear interpretation. That's not dependent on the IEEE definition as such, just that a number is exponent + mantissa. — Robert Dodier, Mar 17 '23 at 22:05
Also, as much as IEEE floats are programming related, this is still 100% a question for (also) asking over on https://math.stackexchange.com — Mike 'Pomax' Kamermans, Mar 17 '23 at 23:24
I'll potentially ask over there but there are people who post here whose input I am interested in. — Adam Hyland, Mar 18 '23 at 01:59
Be that as it may, @njuffa, I’ve heard from a few people that a slide rule serves as good metaphor. Never having using one I don’t know one way or another. — Adam Hyland, Mar 18 '23 at 17:15

Steve Summit · Accepted Answer · 2023-03-21T03:27:53.037

I'm going to limit my answer to the question, "Why does the integer representation of a floating point number offer a piecewise linear approximation to the logarithm?". The fast inverse square root is a fascinating topic, but I don't know enough about it to talk about it.

But it's actually pretty easy to answer the question in the title. First I'll start with the "TL;DR" summary:

This approximation works because when a number is an exact power of the base, its logarithm is (trivially) exactly its exponent when expressed in exponential notation. Floating-point notations like IEEE-754 are base-2 exponential notations, and the exponent occupies almost the highest-precision bits of the format, excepting only the sign bit, which is 0 for positive numbers and so can be ignored. And all of lower-precision bits (that is, beneath the exponent) form a perfectly linear progression, notionally from 000 to 999, due in part to the fact that IEEE-754 formats use normalized significands with a "hidden" leading 1 bit.

The longer explanation starts with a review: What do we mean by a logarithm? Well, it's the inverse of the exponential or power function. Exponentials say things like, ten to the second power (10²) is 100. So the (base 10) logarithm function asks things like, What power do we have to raise 10 to to get the number 100? The answer, of course, is 2.

You can also ask things like, "What power do we have to raise 10 to to get the number 50?" That's obviously harder, because 50 isn't an even power of 10. There's some very cool math behind this which I am not going to try to explain, but raising things to fractional powers is defined in such a way that 10^1.69897 is 50. So the log (base 10) of 50 is 1.69897.

And then you don't have to limit yourself to powers of 10, either. Mathematicians like to use powers of the special number e (2.71828), and of course computer programmers love (just love) to use powers of 2. What power do we have to raise 2 to in order to get the number 32? Every computer geek knows it's 5. So the log (base 2) of 32 is 5.

And then there's scientific notation. Let's take the number 12345. One way of representing that in scientific notation is 1.2345 × 10⁴. And it turns out that gives us a hint as to what the base-10 logarithm of 12345 is. It's going to be greater than or equal to 4, and less than 5. (In fact it's about 4.09149.) This works as long as we pick scientific representations that are normalized in a certain way, with the significand part (popularly called the "mantissa") chosen to be between 1 and 10. (That is, we wouldn't get a good estimate of the logarithm in this way if we'd looked at the otherwise-equivalent notations 0.12345 × 10⁵ or 12.345 × 10³.)

So now let's turn to computer floating-point representations. These typically use — and IEEE-754 definitely uses — base-2 (binary) exponential representations, that is, with normalized significand values multiplied by some power of 2. For example, the number 1.125 is represented as 1.125 × 2⁰, the number 25.25 is represented as 1.578125 × 2⁴, and the number 0.1875 is represented as 1.5 × 2^-3.

And, just as for base 10, those exponents give us a rough, integer-only approximation of the base-2 logarithm — again, as long as we use normalized significands, in this case between 1 and 2. So log₂(25.25) is going to be somewhere between 4 and 5.

It's also time for me to foreshadow a super-special secret fact implied by the fact that we're restricting ourselves to normalized significand values. I said they were going to be "between 1 and 2", but more precisely, they're going to range from 1.00000 to 1.99999. But notice that no matter what the significand value is, for base 2 the first digit is always going to be 1. In a minute we'll see what we can do with that.

The IEEE-754 binary floating-point representations all consist of a sign bit, followed by some number of exponent bits, followed by some number of significand bits. For example, for our number 25.25, we're going to have a 0 bit representing a positive sign, followed by some representation of the exponent 4, followed by some representation of the significand 1.578125.

But now we're getting a glimmer of how the raw-bits interpretation of a floating-point number might have something to do with a logarithm. For positive numbers we don't have to worry about the sign bit (since it's 0), and the next thing we see is going to be the exponent, and we've already seen that the exponent is a rough (integer-only) approximation of the logarithm. So all that's left to ask is, how are we going to interpolate between the power-of-two values that have exact base-2 logarithms (log₂(8) = 3, log₂(16) = 4, log₂(32) = 5, etc.) and the fractional ones in between?

If I gave you evenly-spaced numbers like 100, 200, 300 and asked you to fill in the numbers in between, you'd have little trouble: you'd replace the 00 part by 01, 02, 03, ..., up to 99. And if you knew anything about binary numbers, and I asked you do do the same for evenly-spaced binary numbers like

you could do more or less the same thing, replacing the 00000 part by consecutive binary numbers 00001, 00010, 00011, etc.

Remarkably enough, we're about to see that something exactly like this goes on when we encode significand values in IEEE-754 floating-point numbers.

Let's start looking at an actual IEEE-754 floating-point format. We'll use single-precision float32, to keep the numbers halfway reasonable. This format uses 1 bit for the sign (S), 8 bits for the exponent (e), and 23 bits for the significand (s):

S e e e e e e e e s s s s s s s s s s s s s s s s s s s s s s s

Let's start figuring out what the representations of the numbers 16.000 = 2⁴ and 32.000 = 2⁵ would look like.

16 is 1.0 × 2⁴. So the significand is 1.0, and the exponent is 4. How will these be stored? I'm going to concentrate on the significand first. Remember the observation that the first digit (the one to the left of the decimal point) is always 1? IEEE-754 takes the position that this always-1 leading bit does not need to be stored at all. The actual bits stored for a significand of 1.0 will be 00000. The exponent will be some representation of 4. So I'm going to represent the number 16.000 like this:

0 4 4 4 4 4 4 4 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Those "4"s are approximate; I still haven't gotten around to explaining how the exponent is encoded.

The number 32.000 is similar — actually, it's identical, except for the exponent:

0 5 5 5 5 5 5 5 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

But the key point I'm building up to is that the numbers in between — those greater than 16.0, but less than 32.0, will all be variations on

0 4 4 4 4 4 4 4 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

except with the 0's replaced by all the consecutive 23-bit binary numbers — all 2^23 or 8388608 of them — from 000…001 up to 111…111.

So now we've got just about all the pieces in place to see how "the integer representation of a floating point number offers a piecewise linear approximation to the logarithm". We've shown that the binary representations of the numbers between 16 and 32 form a nice, linear ramp from a number that starts with something like 4, to a number that starts with something like 5.

I've been ducking the question of how the exponents are encoded. That's done using a bias scheme. For binary32, the bit pattern stored is the actual exponent value, plus 127. So our exponent value of 4 will be stored as a bit pattern of 4+127=131, and 5 will be stored as 132. So the actual encodings of 16.0 and 32.0 will be

0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

where I have filled in the binary values of the encoded exponents 131 and 132, namely 10000011 and 10000100. And now we can see the actual, raw values

01000001100000000000000000000000 and 01000010000000000000000000000000, or 0x41800000 and 0x42000000, or 1098907648 and 1107296256.

Those numbers don't have anything obvious to do with the base-2 logarithms of 16 and 32 (namely, 4 and 5), but they're actually very close. They're offset by the effect of the bias, and they're completely wrong in magnitude. But that's not surprising: by taking 32-bit quantities and treating them as integers, we're getting big numbers (which are indeed integers), but the actual values for these logarithms should be 4.0 and 5.0, and we'd like the possibility of a fractional part. So if we're going to treat these big integers as some rendition of the logarithms of some other, smaller numbers, it's going to involve something very much like a fixed-point representation.

Deriving the details of that fixed-point representation is straightforward: the scaling factor is clearly 2²³ = 8388608, and there's also an offset, corresponding to the bias built in to the binary32 format. That offset will be 1065353216 (127 × 2²³) if applied before scaling, or 127 afterwards.

Let's see if this works. We have come up with integer values of 1098907648 and 1107296256 corresponding to 16.0 and 32.0. So:

(1098907648 - 1065353216) / 8388608 = 4
(1107296256 - 1065353216) / 8388608 = 5

or the other way:

1098907648 / 8388608 - 127 = 4
1107296256 / 8388608 - 127 = 5

So it's working!

It's said that a picture is worth 1,000 words, and it's long past time to do some of this explaining that way. Here's a graph which — not coincidentally — looks an awful lot like the one you posted from Mitchell 1961:

The red curve is the actual base-2 log function. The black dots are the places where log₂(X) is an integer. Those are the points where the raw bits of a binary32 float value, interpreted as an integer and suitably scaled, are exactly equal to log₂. And then in blue we have straight line segments directly connecting those points.

To recap: If we take the raw bits of binary32 float values corresponding to exact powers of 2, we get:

X	binary32 (hex)	binary32 (dec)	log₂
0.25	`3e800000`	1048576000	-2
0.5	`3f000000`	1056964608	-1
1	`3f800000`	1065353216	0
2	`40000000`	1073741824	1
4	`40800000`	1082130432	2
8	`41000000`	1090519040	3
16	`41800000`	1098907648	4
32	`42000000`	1107296256	5

An exact power of two has a significand of 1.0 (that is, these numbers are all 1.0 × 2^N), meaning that they are encoded in IEEE-754 as all 0's (due to the "hidden 1 bit" property). And since they're positive, their sign bit is 0 also, and the only part that isn't 0 is the exponent. And since for exact powers the exponent is the logarithm, for these numbers, the integer interpretation of the raw bit pattern is always exactly the logarithm, after shifting and scaling. (I presume that some aspect of that "shifting and scaling" is where the magic number 0x5f37642f comes in.)

And then, for all the numbers that aren't exact powers of 2, and for which the logarithm isn't an integer, the way the significand values are laid out means that the raw values (as shown by the blue lines in the plot above) form a perfect linear interpolation between the exact powers. Therefore, in (almost) all cases, the raw bit patterns do indeed end up being a "piecewise linear approximation to the logarithm". (I said "almost" because this result does not hold for the subnormals, and for the special values Inf and NaN.)

The other interpretation of the question "Why" is, "Why did the designers of IEEE-754 set things up this way?" I don't believe it was deliberate, that is, I don't believe that it was a requirement that the raw bit pattern, interpreted as an integer, corresponded to a piecewise linear approximation of the base-2 logarithm. However, given these three related facts:

the definition of an exponential notation basically is a piecewise approximation of a logarithm
IEEE-754 did have a goal of keeping positive floating-point values monotonically increasing per their raw bit values
IEEE-754 hides the high-order "1" bit of the significand; all significands are of the form 1.xxx but only the xxx part is stored as such

the "piecewise liner approximation" result falls out rather neatly.

Thank you for adding the picture. I didn't want to clutter the question by adding one but looking at yours I think I will add one (probably from an old source for fun). Reading through your answer now. — Adam Hyland, Mar 18 '23 at 18:32
Added two images from references which were in the question and one from J.M. Muller's chapter, which I have just now added. — Adam Hyland, Mar 18 '23 at 20:53
This is a great answer and the work you have put into it is evident. I have started a bounty but I hope you see from the text that you are very much the answer to beat. I hope this page can become the reference I wish I had when I ran into the FISR in 2009. — Adam Hyland, Mar 20 '23 at 02:20
@AdamHyland Thanks, and no worries about the bounty. I'm not perfectly satisfied with this answer, either, and I wouldn't mind seeing a better (hopefully more concise!) one. — Steve Summit, Mar 20 '23 at 12:19

njuffa · Answer 2 · 2023-03-23T21:04:06.563

The quoted text in the question is factually incorrect. A more accurate version would read: The bit pattern of a positive normal IEEE-754 floating point number, interpreted as a fixed-point number and adjusted for exponent bias, gives a piecewise linear approximation to the binary logarithm function.

In order for the technique to work, the floating-point format must be binary, and the most significant bit of the significand must be implicit and not actually stored. Furthermore it must store the significand in a contiguous block of least significant bits, and the exponent in the immediately abutting contiguous block of more significant bits. Some floating-point formats preceding IEEE-754 fulfill these requirements, but many do not.

The technique works because IEEE-754 floating-point formats use a semi-logarithmic representation, with the exponent providing a logarithmic scale but the significand providing linear scale subdivisions between those. This is different from a slide rule , which is fully logarithmic, i.e. both major divisions and sub-divisions follow a logarithmic scale.

If we have a positive normal IEEE-754 binary32 number y, its stored exponent E is the true exponent e plus the exponent bias of 127. Of the significand s, which is in [1,2), only the fractional part f in [0, 1) is stored, that is s = 1 + f, as the most significant bit of the significand is implicit. The value of the number y is 2^e * (1 + f). Its binary logarithm is log₂ (2^e * (1 + f)) = log₂ (2^e) + log₂ (1 + f). Given the domain of f, a very simple and rough approximation is log₂ (1 + f) = f (note that this matches the mathematical result at the end points), so log₂ (2^e * (1 + f)) ≈ e + f.

In other words: with the exponent bias subtracted out, the number y represents an s8.23 fixed-point representation of a rough approximation of the binary logarithm of y. To convert this into a floating-point number divide by 2²³ or alternatively multiply by 2^-23. To make this process more efficient on most modern processors, one can subtract out the exponent bias at the very end of the process instead of at the very start. This results in the following exemplary ISO-C99 implementation with a maximum absolute error of 8.6e-2:

#include <stdint.h>
#include <string.h>

uint32_t float_as_uint32 (float x) 
{ 
    uint32_t r; 
    memcpy (&r, &x, sizeof r); 
    return r; 
} 

float fastest_log2 (float y)
{
    const int FP32_EXPO_BIAS = 127;
    const float two_to_m23 = 1.192092896e-7f; // 0x1.0p-23
    float x = (float)(int) float_as_uint32 (y);
    return x * two_to_m23 - (float) FP32_EXPO_BIAS;
}

On hardware that supports fused multiply-add operations and with appropriate compiler settings, the last line of fastest_log2() should map to some flavor of FMA.

Good answer, yet "In order for the technique to work, the floating-point format must be binary" and " most significant bit of the significand must be implicit and not actually stored" is not true, in general. Yes having those 2 limitations does help sample code. — chux - Reinstate Monica, Mar 25 '23 at 15:52
One limitation not yet mentioned: endian of the `float` and `uint32_t` needs to be the same. — chux - Reinstate Monica, Mar 25 '23 at 15:52
@chux-ReinstateMonica In order to **simply re-interpret** (e.g. the `float_to_uint32()` in my code) the bit pattern of a floating-point number as a fixed-point number that approximates the binary logarithm (**the claim from the question**), the floating point format must be binary and with implicit leading bit and the components ordered as stated. If those conditions are not met, one would first need to use bit manipulations to re-arrange bits, and in case of IBM hexadecimal floating-point format do base-conversion of the exponent. That is beyond what is claimed. — njuffa, Mar 25 '23 at 19:26

Eric Postpischil · Answer 3 · 2023-03-22T02:46:51.820

This answer discusses positive, finite normal numbers in binary floating-point formats. (The premise does not hold for zeros, negative numbers, infinities, subnormal numbers, and NaNs.)

A floating-point number y is represented as ±b^es for a fixed base b, an integer exponent e, and a significand s, 0 ≤ s < b. For normal forms, 1 ≤ s < b. For binary formats, b is 2. For positive numbers, the sign is of course +. For any particular floating-point format, there are bounds on e and bounds on both the magnitude and the precision of s (the number of base-b digits in s). For general discussion of this answer, the particular characteristics do not greatly concern us, although they may be introduced at points.

Given y = 2^es, log₂ y = log₂ (2^es) = e + log₂ s.

In the IEEE 754 “single precision” format, binary32, commonly used for float, a normal number 2^es is encoded as a 0 bit for the sign, 8 bits that are the binary for e+127, and 23 bits that are the binary for (s−1)2²³. Reinterpreting these 32 bits as a binary numeral (most significant to least significant in the order listed above) yields the value (e+127)2²³+(s−1)2²³ = (e+126+s)2²³.

My question is, what is it about floating point numbers which gives rise to this characteristic?

As we can see above, reinterpreting the encoding of a floating-point number, particularly the IEEE-754 binary32, does not give an approximation of the logarithm, as it is completely out of scale and biased by an offset.

However, if we consider the adjustment f(x) = x/2²³−127, then f(y reinterpreted as binary) = f((e+126+s)2²³) = (e+126+s)2²³/2²³−127 = e+s−1. Now we see we have the exponent e of y; this value e+s−1 is at least near log₂ y = e + log₂ s. Thus, it is an approximation of log₂ y. (Also note that s−1 is 0 and 1 at s = 1 and 2, respectively, the same as log₂ s. So it serves as an approximation of log₂ s over the interval the significand covers.)

If we take each value of e as one piece for the piecewise linear approximation, the approximation e+s−1 is linear because it fits the linear model e+s−1 = ay + b where a = 2^−e and b = e−1, as we can see since ay + b = 2^−ey + e − 1 = 2^−e(2^es) + e − 1 = e+s−1.

Regarding the adjustment function f(x) = x/2²³−127, you will often see aspects of this in code that manipulates the encoding of floating-point numbers—it will in various ways add or subtract to account for the bias of the exponent and the leading 1 bit of the significand, and it will in various ways shift to move the exponent field where desired.