Why do IEEE 754 floating-point numbers use a sign bit of 1 for negative numbers?

Question

The typical reason given for using a biased exponent (also known as offset binary) in floating-point numbers is that it makes comparisons easier.

By arranging the fields such that the sign bit takes the most significant bit position, the biased exponent takes the middle position, then the significand will be the least significant bits and the resulting value will be ordered properly. This is the case whether or not it is interpreted as a floating-point or integer value. The purpose of this is to enable high speed comparisons between floating-point numbers using fixed-point hardware.

However, because the sign bit of IEEE 754 floating-point numbers is set to 1 for negative numbers and 0 for positive numbers, the integer representation of negative floating-point numbers is greater than that of the positive floating-point numbers. If this were reversed, then this would not be the case: the value of all positive floating-point numbers interpreted as unsigned integers would be greater than all negative floating-point numbers.

I understand this wouldn't completely trivialize comparisons because NaN != NaN, which must be handled separately (although whether or not this is even desirable is questionable as discussed in that question). Regardless, it's strange that this is the reason given for using a biased exponent representation when it is seemingly defeated by the specified values of the sign and magnitude representation.

There is more discussion on the questions "Why do we bias the exponent of a floating-point number?" and "Why IEEE floating point number calculate exponent using a biased form?" From the first, the accepted answer even mentions this (emphasis mine):

The IEEE 754 encodings have a convenient property that an order comparison can be performed between two positive non-NaN numbers by simply comparing the corresponding bit strings lexicographically, or equivalently, by interpreting those bit strings as unsigned integers and comparing those integers. This works across the entire floating-point range from +0.0 to +Infinity (and then it's a simple matter to extend the comparison to take sign into account).

I can imagine two reasons: first, using a sign bit of 1 for negative values allows the definition of IEEE 754 floating-point numbers in the form -1^s x 1.f^e-b; and second, the floating-point number corresponding to a bit string of all 0s is equal to +0 instead of -0.

I don't see either of these as being meaningful especially considering the common rationale for using a biased exponent.

All-zero bits being +0 might actually be a desirable property since x/-0 is -INF, right? Same with stuff like `copysign`. Being able to `memset(x, 0)` is a nice feature. Having the bit pattern and various effects be different from `*x = 0.f` could create some gotchas down the line — Homer512, Mar 01 '23 at 10:08
It might be useful for some applications at the least but I'm not sure if that would have played any role in the IEEE standardization process or rationale. In any case it only really changes the `memset(..., 0)` behavior, slightly, which for what it's worth is undefined behavior in C for integer types. — Thanks for flying Vim, Mar 01 '23 at 10:21
It's not just memset btw. It's also the default initialization of global variables, static arrays, etc. — Homer512, Mar 01 '23 at 11:09
No, refer to C99 §6.7.8 Initialization, "If an object that has static storage duration is not initialized explicitly, then: [...] if it has arithmetic type, it is initialized to (positive or unsigned) zero" and later "all subobjects that are not initialized explicitly shall be initialized implicitly the same as objects that have static storage duration" etc. Using `memset` to initialize objects is generally undefined, and would only ever be valid for floating-point types if `__STDC_IEC_559__` is defined (in theory). — Thanks for flying Vim, Mar 02 '23 at 00:52
The same clause is used to specify the default initialization of pointer types to null pointers. It is a common misconception that the C standard specifies that a null pointer must have an object representation of all null characters (corresponding to an all 0 bit string), the same as for integer types. As far as I know all modern systems do work this way, but it's still not a good idea because it is unnecessary and introducing undefined behavior can in some cases cause bad compiler optimizations. — Thanks for flying Vim, Mar 02 '23 at 00:56
So the quote you provided explicitly states that the zero needs to be non-negative. In other words, if the `+0.f` representation is anything other than all-zero bits, any static storage needs an initializer to set it. With the current implementation, the runtime environment doesn't need to care whether an array/global is float or int, the same behavior applies. Btw, don't you think it's a bit contradictory that you strongly argue against using memset, calloc, mmap, etc to zero-initialize float arrays while arguing that bit pattern comparisons on floats are valuable features? — Homer512, Mar 02 '23 at 06:19
Objects with static storage duration are directly embedded in the executable or other binary object. No runtime initialization is required; whatever bit representation is used for the values in the initializer is determined by the C compiler/assembler and included in the output file. "Bit pattern comparisons on floats" would still have to be done by floating-point hardware because `NaN != NaN`, and it is typically faster/more convenient when already using floating-point instructions. However, making this comparison as simple as possible could speed up the hardware, wherein lies my question. — Thanks for flying Vim, Mar 02 '23 at 07:56
[Drepper's How To Write Shared Libraries](http://library.bagrintsev.me/CPP/dsohowto.pdf): "The size in the file can be smaller than the address space it takes up in memory. The first `p_filesz` bytes of the memory region are initialized from the data of the segment in the file, the difference is initialized with zero. This can be used […] for uninitialized variables which are according to the C standard initialized with zero." This would no longer work with a different bit pattern and make all executables with static floats larger as a consequence, turning them into CoW pages. — Homer512, Mar 02 '23 at 18:23
That is true, although it doesn't work for objects with initializers even if only a small part is initialized and the rest is default initialized. For example, `echo 'float a[1024*1024]={1.f};' | gcc -c -x c - -o out.o` outputs a 4MiB file. For `.bss` segment data, it could be handled by the C runtime before calling the `main` function as you initially suggested. — Thanks for flying Vim, Mar 03 '23 at 05:16

chux - Reinstate Monica · Answer 1 · 2023-03-01T08:56:10.957

1

Back in the day, signed integers were encoded using 2's complement (ubiquitous today), 1s' complement and signed magnitude - with some variations on -0 and trap values.

All 3 could be realized well enough in hardware with similar performance and hardware complexity. A sizeable amount of hardware and software designs exist for all 3.

IEEE Floating point can do compares quite easily when viewed as signed magnitude.

OP's suggested "If this were reversed" creates a 4th integer encoding.

Why do IEEE 754 floating-point numbers use a sign bit of 1 for negative numbers?

To mimic the symmetry of signed magnitude integers, take advantage of prior art and not yet another encoding.

edited Mar 01 '23 at 08:56

answered Mar 01 '23 at 08:48

chux - Reinstate Monica

143,097
13
135
256

Isn't it the case that all floating-point formats preceding those defined by IEEE-754 and going back to the 22-bit floating-point format used in Zuse's Z3 computer (designed in 1938 and completed in 1941), continuing with IBM and DEC floating-point formats and many others, already used the convention: sign bit=1 is negative, sign bit=0 is positive? Which means the IEEE-754 committee applied the principle of "least surprise" by continuing the convention. While compatibility with integer conventions may have played a role in some earlier decisions I am not aware of a primary source that says so. – njuffa Mar 01 '23 at 09:20
@njuffa What I do recall, circa late 70's, was learning how to convert an integer in 1 of 3 formats to the other 2 both in HW/SW. Also hearing then FP was like signed-magnitude. IEEE-754 committee needed to gain acceptance and using a _known_ integer like format (echoing your "least surprise") was certainly a motivation. IIRC, earliest mechanical/electronic HW was all signed magnitude given its simplistic match to how we do math by hand. – chux - Reinstate Monica Mar 01 '23 at 09:33
@njuffa It is only changes (like 2's complement) with its somewhat simpler HW realization (and 2^n values) that caused it to with the `integer` race, yet that is asymmetric for FP. – chux - Reinstate Monica Mar 01 '23 at 09:33
@njuffa Now if could all agree on a common _endian_.... – chux - Reinstate Monica Mar 01 '23 at 09:35
I assume there wasn't any risk of the pre-IEEE hardware supporting the new floating-point formats anyways, so a new hardware implementation was required. Reversing the meaning of the signs would simplify the implementation of comparisons, without compromising any other functionality as far as I know because the exponent already uses a biased representation. Regardless, your point is well taken that using the traditional meaning would make it an easier transition. – Thanks for flying Vim Mar 01 '23 at 09:45
@chux-ReinstateMonica From what I have read about the genesis of IEEE-754 from Kahan, there was a conscious decision to make the floating-point formats *similar* to the VAX floating-point formats because those were considered quite useful and widely known (thus guaranteeing easier acceptance) and adding features from CDC floating point arithmetic to that (e.g. the wider exponent field for double precision). – njuffa Mar 01 '23 at 10:27
2

C.R. Severance, "IEEE 754: An Interview with William Kahan", *Computer* 31(3):114-115: "WK: The existing DEC VAX format had the advantage of a broadly installed base. Originally, the DEC double-precision format [...] too few exponent bits for some double-precision computations. DEC addressed this by introducing its G double-precision format, which supported an 11-bit exponent and which was the same as the CDC floating-point format. With the G format, the major remaining difference between the Intel format and the VAX format was gradual underﬂow." – njuffa Mar 01 '23 at 10:34
@njuffa That's a great reference. I wasn't aware the IEEE standard attempted to follow any existing successful format that closely. Assuming the CDC format also used a sign and magnitude representation with a 1 bit for negative numbers, that alone would be sufficient to explain the decision. Although assuming the CDC etc. formats also used a biased exponent, the question might be why they didn't change it for comparisons either. In any case it sounds like a combination of historical circumstance, and the disadvantage for comparisons isn't the biggest concern. – Thanks for flying Vim Mar 01 '23 at 10:45

score 0 · Accepted Answer · answered Mar 09 '23 at 10:56

I found the reference "Radix Tricks" on the Wikipedia article for the IEEE 754 standard, where in the section titled "Floating point support" the author describes the steps necessary to compare two floating-point numbers as unsigned 2's complement integers (specifically, 32-bit IEEE 754 single-precision floating-point numbers).

In it, the author points out that simply flipping the sign bit is insufficient because the encoded significand of a large (higher magnitude) negative number interpreted as an unsigned integer will be greater than that of a smaller negative number, when of course a larger negative number should be lesser than a smaller one. Similarly, a negative number with a larger biased exponent is actually less than one with a smaller biased exponent, such that negative numbers with the unbiased exponent e^max are less than those with the unbiased exponent e^min.

In order to correct for this, the sign bit should be flipped for positive numbers, and all bits should be flipped for negative numbers. The author presents the following algorithm:

uint32_t cmp(uint32_t f1, uint32_t f2)
{
    uint32_t f1 = f1 ^ (-(f1 >> 31) | 0x80000000);
    uint32_t f2 = f2 ^ (-(f2 >> 31) | 0x80000000);
    return f1 < f2;
}

The purpose in explaining this is to clarify that inverting the sign bit does not make it possible to directly compare finite floating-point numbers as unsigned 2's complement integers. On the contrary, using sign and magnitude hardware (which must interpret the sign bit as a sign bit, and not as a part of an unsigned integer) requires no additional bitwise operations and should therefore result in the simplest, smallest, and most efficient design.

It is possible to create a floating-point format encoding that uses 2's complement, and it has been studied as detailed in this paper. However, this is far beyond the scope of the question and involves many additional complexities and problems to be solved. Perhaps there is a better way, but the IEEE 754 design has the advantage that it is obviously satisfactory for all use cases.

Why do IEEE 754 floating-point numbers use a sign bit of 1 for negative numbers?

2 Answers2