15

I learned from the book Computer Systems: A Programmer's Perspective that the IEEE standard requires the double precision floating number to be represented using the following 64-bit binary format:

  • s: 1 bit for sign
  • exp: 11 bits for exponent
  • frac: 52 bits for fraction

The +infinity is represented as a special value with the following pattern:

  • s = 0
  • all exp bits are 1
  • all fraction bits are 0

And I think the full 64-bit for double should be in the following order:

(s)(exp)(frac)

So I write the following C code to verify it:

//Check the infinity
double x1 = (double)0x7ff0000000000000;  // This should be the +infinity
double x2 = (double)0x7ff0000000000001; //  Note the extra ending 1, x2 should be NaN
printf("\nx1 = %f, x2 = %f sizeof(double) = %d", x1,x2, sizeof(x2));
if (x1 == x2)
    printf("\nx1 == x2");
else
    printf("\nx1 != x2");

But the result is:

x1 = 9218868437227405300.000000, x2 = 9218868437227405300.000000 sizeof(double) = 8
x1 == x2

Why is the number a valid number rather than some infinity error?

Why x1==x2?

(I am using the MinGW GCC compiler.)

ADD 1

I modified the code as below and the validated the Infinity and NaN successfully.

//Check the infinity and NaN
unsigned long long x1 = 0x7ff0000000000000ULL; // +infinity as double
unsigned long long x2 = 0xfff0000000000000ULL; // -infinity as double
unsigned long long x3 = 0x7ff0000000000001ULL; // NaN as double
double y1 =* ((double *)(&x1));
double y2 =* ((double *)(&x2));
double y3 =* ((double *)(&x3));

printf("\nsizeof(long long) = %d", sizeof(x1));
printf("\nx1 = %f, x2 = %f, x3 = %f", x1, x2, x3); // %f is good enough for output
printf("\ny1 = %f, y2 = %f, y3 = %f", y1, y2, y3);

The result is:

sizeof(long long) = 8
x1 = 1.#INF00, x2 = -1.#INF00, x3 = 1.#SNAN0
y1 = 1.#INF00, y2 = -1.#INF00, y3 = 1.#QNAN0

The detailed output looks a bit strange, but I think the point is clear.

PS.: It seems the pointer conversion is not necessary. Just use %f to tell the printf function to interpret the unsigned long long variable in double format.

ADD 2

Out of curiosity, I checked the bit represetation of the variables with the following code.

typedef unsigned char *byte_pointer;

void show_bytes(byte_pointer start, int len)
{
    int i;
    for (i = len-1; i>=0; i--)
    {
        printf("%.2x", start[i]);
    }
    printf("\n");
}

And I tried the code below:

//check the infinity and NaN
unsigned long long x1 = 0x7ff0000000000000ULL; // +infinity as double
unsigned long long x2 = 0xfff0000000000000ULL; // -infinity as double
unsigned long long x3 = 0x7ff0000000000001ULL; // NaN as double
double y1 =* ((double *)(&x1));
double y2 =* ((double *)(&x2));
double y3 = *((double *)(&x3));

unsigned long long x4 = x1 + x2;  // I want to check (+infinity)+(-infinity)
double y4 = y1 + y2; // I want to check (+infinity)+(-infinity)

printf("\nx1: ");
show_bytes((byte_pointer)&x1, sizeof(x1));
printf("\nx2: ");
show_bytes((byte_pointer)&x2, sizeof(x2));
printf("\nx3: ");
show_bytes((byte_pointer)&x3, sizeof(x3));
printf("\nx4: ");
show_bytes((byte_pointer)&x4, sizeof(x4));

printf("\ny1: ");
show_bytes((byte_pointer)&y1, sizeof(y1));
printf("\ny2: ");
show_bytes((byte_pointer)&y2, sizeof(y2));
printf("\ny3: ");
show_bytes((byte_pointer)&y3, sizeof(y3));
printf("\ny4: ");
show_bytes((byte_pointer)&y4, sizeof(y4));

The output is:

x1: 7ff0000000000000

x2: fff0000000000000

x3: 7ff0000000000001

x4: 7fe0000000000000

y1: 7ff0000000000000

y2: fff0000000000000

y3: 7ff8000000000001

y4: fff8000000000000  // <== Different with x4

The strange part is, though x1 and x2 have the identical bit pattern as y1 and y2, the sum x4 is different from y4.

And

printf("\ny4=%f", y4);

gives this:

y4=-1.#IND00  // What does it mean???

Why are they different? And how is y4 obtained?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
smwikipedia
  • 327
  • 3
  • 7

4 Answers4

20

First, 0x7ff0000000000000 is indeed the bit representation of a double infinity. But the cast does not set the bit representation, it converts the logical value of 0x7ff0000000000000 interpreted as a 64 bit integer. So, you need to use some other way to set the bit pattern.

The straightforward way to set the bit pattern would be

uint64_t bits = 0x7ff0000000000000;
double infinity = *(double*)&bits;

However, this is undefined behavior. The C standard forbids reading a value that has been stored as one fundamental type (uint64_t) as another fundamental type (double). This is known as strict aliasing rules, and allows the compiler to generate better code because it can assume that the order of the read of one type and a write of another type is irrelevant.

The only exception to this rule is the char types: You are explicitly allowed to cast any pointer to a char* and back. So you could try to use this code:

char bits[] = {0x7f, 0xf0, 0, 0, 0, 0, 0, 0};
double infinity = *(double*)bits;

Even though this is not undefined behavior anymore, it is still implementation defined behavior: The order of the bytes in a double depends on your machine. The given code works on a big endian machine like ARM and the Power family, but not on X86. For the X86 you need this version:

char bits[] = {0, 0, 0, 0, 0, 0, 0xf0, 0x7f};
double infinity = *(double*)bits;

There is really no way around this implementation defined behavior since there is no guarantee that a machine will store floating point values in the same order as integer values. There are even machines that use byte orders like this: <1, 0, 3, 2> I don't even want to know who came up with this brilliant idea, but it exists and we have to live with it.


To your last question: floating point arithmetic is inherently different from integer arithmetic. The bits have special meanings, and the floating point unit takes that into account. Especially the special values like infinities, NANs, and denormalized numbers are treated in a special way. And since +inf + -inf is defined to yield a NAN, your floating point unit emits the bit pattern of a NAN. The integer unit does not know about infinities or NAN, so it just interpretes the bit pattern as a huge integer and happily performs an integer addition (which happens to overflow in this case). The resulting bit pattern is not that of a NAN. It happens to be the bit pattern of a really huge, positive floating point number (2^1023, to be precise), but that bears no meaning.


Actually, there is a way to set the bit patterns of all values except NANs in a portable way: Given three variables containing the bits of the sign, exponent, and mantissa, you can do this:

uint64_t sign = ..., exponent = ..., mantissa = ...;
double result;
assert(!(exponent == 0x7ff && mantissa));    //Can't set the bits of a NAN in this way.
if(exponent) {
    //This code does not work for denormalized numbers. And it won't honor the value of mantissa when the exponent signals NAN or infinity.
    result = mantissa + (1ull << 52);    //Add the implicit bit.
    result /= (1ull << 52);    //This makes sure that the exponent is logically zero (equals the bias), so that the next operation will work as expected.
    result *= pow(2, (double)((signed)exponent - 0x3ff));    //This sets the exponent.
} else {
    //This code works for denormalized numbers.
    result = mantissa;    //No implicit bit.
    result /= (1ull << 51);    //This ensures that the next operation works as expected.
    result *= pow(2, -0x3ff);    //Scale down to the denormalized range.
}
result *= (sign ? -1.0 : 1.0);    //This sets the sign.

This uses the floating point unit itself to move the bits into the right place. Since there is no way to interact with the mantissa bits of a NAN using floating point arithmetic, it is not possible to include the generation of NANs in this code. Well, you could generate a NAN, but you'd have no control on its mantissa bit pattern.

cmaster - reinstate monica
  • 38,891
  • 9
  • 62
  • 106
  • @cmaster can't use use `hton` and `ntoh` family to determine the byte order (or at least make it consistent) – clcto Nov 01 '14 at 20:40
  • 1
    @clcto `ntohd()` would do the trick, but afaik, it is not a part of the POSIX standard. The glibc seems only to implement `ntohs()` and `ntohl()`, both of which work only on integers. And since integers may use a different byte order than floats, this is not even sufficient to set the bits of a `float`. – cmaster - reinstate monica Nov 01 '14 at 21:12
  • In light of David Hammen [quoting chapter and verse](http://stackoverflow.com/questions/26688630/how-is-infinity-represented-in-a-c-double#comment41981827_26688666) for the contrary opinion below it's possible that it's possible that *"The C standard forbids [...]"* needs citation. – dmckee --- ex-moderator kitten Nov 02 '14 at 16:32
  • "Even though this is not undefined behavior anymore, it is still implementation defined behavior" Nope, it's still UB. `&bits` is not `double*`, so conversion from `char*` is not allowed. In real life this means that code blows up when alignment doesn't match, or when compiler optimization decides to do funny stuff. Use `union` or `memcpy` instead. – user694733 Oct 28 '16 at 14:15
8

The initialization

double x1=(double)0x7ff0000000000000;

is converting an integer number literal to a double. You probably want to share the bitwise representation. This is implementation specific (perhaps unspecified bahavior), but you might use an union:

union { double x; long long n; } u;
u.n = 0x7ff0000000000000LL;

then use u.x ; I am assuming that long long and double are both 64 bits on your machine. And endianess and floating point representation also matters.

See also http://floating-point-gui.de/

Notice that not all processors are x86, and not all floating point implementations are IEEE754 (even if in 2014 most of them are). Your code probably won't work the same on an ARM processor, e.g. in your tablet.

Basile Starynkevitch
  • 223,805
  • 18
  • 296
  • 547
  • 4
    Is this legal? I thought the C spec said it was undefined behaviour to read from any union member other than the one that was last written to... but I am probably just imagining things – dreamlax Nov 01 '14 at 11:08
  • It probably is unspecified behavior, and it certainly is implementation specific. I don't think it is undefined behavior. – Basile Starynkevitch Nov 01 '14 at 11:09
  • 1
    But how can unspecified behaviour be defined? Do you just mean that it's deterministic, and predictable in practice? – chiastic-security Nov 01 '14 at 11:11
  • 1
    It is implementation specific, so in some way you could feel it is defined by your implementation. – Basile Starynkevitch Nov 01 '14 at 11:12
  • 1
    I feel like it's possible that `long long` and `double` may have different alignment requirements on some architectures (again, speculating here), which would mean problems later. Perhaps the only suitable way is to use `memcpy()` to copy from a known-aligned `long long` to a known-aligned `double`. – dreamlax Nov 01 '14 at 11:20
  • 1
    But even with different alignment, a `union` will keep the most strong alignment constraint of it members. So it will work. – Basile Starynkevitch Nov 01 '14 at 11:21
  • In my scenario, it sufficies to just use `%f` to tell the `printf` function to interpret the `unsigned long long` variable in `double` format. – smwikipedia Nov 01 '14 at 12:03
  • 1
    @smwikipedia: that is undefined behaviour, it is not guaranteed to work at all. – dreamlax Nov 01 '14 at 14:15
  • 2
    Even in a union it violates strict aliasing rules. The compiler is allowed to perform the read before the write. And then they are allowed to optimize the write away, because the data is never read, etc. Definitely UB. – cmaster - reinstate monica Nov 01 '14 at 19:22
  • 3
    @cmaster - While this is undefined behavior in C++, *this question is not tagged as C++*. It's tagged as C. Type punning via a union is legal in C and does not violate C's strict aliasing rules. You should direct your ire against the answer that uses casting. That is UB in C as well as in C++. – David Hammen Nov 01 '14 at 19:36
  • 1
    @DavidHammen Where did you get the thing about type punning being legal if done via a union? It *used* to be legal before strict aliasing rules were introduced. But since we have strict aliasing rules in C (C99, I believe), *all* type punning is illegal. The only exceptions are the `char` types and the `memcpy()` function. The special thing about unions is that you are allowed to access variables in the beginning of member structs *that are identical in type and sequence* from any of the struct members, irrespective of which struct member was used to store them. – cmaster - reinstate monica Nov 01 '14 at 19:56
  • 2
    @cmaster - Storing a union member of one type and retrieving a member of a different type was UB in C90. Compiler vendors received so many complaints that many restored this widely used pre-ANSI behavior. The C99 standard removed this wording. The C11 standard retained the C99 wording and added a footnote, *If the member used to read the contents of a union object is not the same as the member last used to store a value in the object, the appropriate part of the object representation of the value is reinterpreted as an object representation in the new type as described in 6.2.6* – David Hammen Nov 01 '14 at 20:55
  • 2
    Also see http://stackoverflow.com/questions/11442708/type-punning-and-unions-in-c and http://stackoverflow.com/questions/11639947/is-type-punning-through-a-union-unspecified-in-c99-and-has-it-become-specified (and others). – David Hammen Nov 01 '14 at 20:56
  • @BasileStarynkevitch has it exactly right. In C99 and C11, accessing a member of a union "other than the last one stored into" (wording straight from the standard, appendix J.1) is unspecified behavior, not undefined behavior (that's the subject of in appendix J.2). – David Hammen Nov 01 '14 at 21:39
8

You're casting the value to a double and that isn't going to work as you expect.

double x1=(double)0x7ff0000000000000; // Not setting the value directly

To avoid this issue you could interpret that value as a double pointer and dereference it (although this is terribly unrecommended and will only work with unsigned long long == double size constraints):

unsigned long long x1n = 0x7ff0000000000000ULL; // Inf
double x1 = *((double*)&x1n);
unsigned long long x2n = 0x7ff0000000000001ULL; // Signaling NaN
double x2 = *((double*)&x2n);

printf("\nx1=%f, x2=%f sizeof(double) = %d", x1, x2, sizeof(x2));
if (x1 == x2)
    printf("\nx1==x2");
else
    printf("\nx1!=x2"); // x1 != x2

Example on ideone

Sumurai8
  • 20,333
  • 11
  • 66
  • 100
Marco A.
  • 43,032
  • 26
  • 132
  • 246
6

You've converted the constant 0x7ff00... into a double. This isn't at all the same as taking the bit representation of that value and interpreting it as a double.

This also explains why x1==x2. When you convert to a double, you lose precision; so sometimes for large integers, the double that you end up with is the same in the two cases. This gives you some weird effects, where for a large floating point value, adding 1 leaves it unchanged.

chiastic-security
  • 20,430
  • 4
  • 39
  • 67