0

With this code:

    #include <float.h>
    printf("FLT_MAX = %e\n", FLT_MAX);
    
    unsigned long ul = 1234567891011;
    float f = (float)ul;
    printf("f=%f", f);

i would have expected f=1234567891011.000000 which is my integer with decimals. Instead the result is:

FLT_MAX = 3.402823e+38
f=1234567954432.000000

Knowing that:

sizeof(long unsigned) = 8 bytes
sizeof(double)        = 8 bytes
sizeof(float)         = 4 bytes

I can solve the problem by using double instead of float but i want to understand:

  • why this casting fails despite 1234567891011 being much smaller than ULONG_MAX (2^64) and FLT_MAX ? (but indeed, suspiciously larger than 2^32)
  • how can FLT_MAX reach 3.402823e+38 with only 4 bytes?
  • is there a clean way to handle this case, other than checking if my unsigned long is bigger than a specific threshold? (2^32 ?)

PS: yes there is this old similar question, but it's incorrect for large numbers

Eric Postpischil
  • 195,579
  • 13
  • 168
  • 312
acromarco
  • 21
  • 2
  • 3
    Hint: `float` only has about 8 digits of precision. – NathanOliver Apr 06 '22 at 15:50
  • 1
    Not sure if should be a duplicate of https://stackoverflow.com/questions/588004/is-floating-point-math-broken – Eugene Sh. Apr 06 '22 at 15:51
  • @EugeneSh. Probably not, although it is closely related. – Mark Ransom Apr 06 '22 at 15:52
  • 2
    By the way there's nothing improper about that cast. – Mark Ransom Apr 06 '22 at 15:54
  • 2
    _"how can FLT_MAX reach 3.402823e+38 with only 4 bytes?"_ - because an IEEE-754 32 bit float has 8 bits which is the exponent. Perhaps [the IEEE 754 wiki](https://en.wikipedia.org/wiki/IEEE_754) helps – Ted Lyngmo Apr 06 '22 at 15:54
  • @MarkRansom it's not improper in the senses that it compiles, but improper in the sense that it distorts the integer. Wouldn't it make sense for the compiler to complain? – acromarco Apr 06 '22 at 15:57
  • @acromarco No it wouldn't. A float can't accurately describe all integers smaller than `FLT_MAX` – Ted Lyngmo Apr 06 '22 at 15:59
  • @acromarco This is a perfectly legal operation and the compiler expects the programmer to know what they are doing (including the limitations of floating point representation). – Eugene Sh. Apr 06 '22 at 16:00
  • 2
    @acromarco [A good compiler will complain](http://coliru.stacked-crooked.com/a/3673fbaec3c3df5d), unfortunately doing `(float)ul;` shuts it up as the cast tells the compiler you know what you are doing. – NathanOliver Apr 06 '22 at 16:00
  • @EugeneSh. thanks for the link, i understand better about precision. But i am still wondering how to handle this case gracefully, Re "_is there a clean way to handle this case, other than checking if my unsigned long is bigger than a specific threshold? (2^32 ?)_" – acromarco Apr 06 '22 at 16:01
  • @NathanOliver indeed, but the behavior is the same without cast (using Replit's compiler) – acromarco Apr 06 '22 at 16:03
  • You need a "larger" type. `double` has 52 bits of mantissa, so it can store any integer that uses 52 bits or less of space. – NathanOliver Apr 06 '22 at 16:04
  • @NathanOliver actually 53 bits, since double has an implied 1 bit to the left of what's specified. – Mark Ransom Apr 06 '22 at 16:09
  • 1
    The cast isn't the issue here; it's the conversion that the cast tells the compiler to do. But that's a conversion that it would do anyway, without the cast. `float f = ul;` does exactly the same thing as `float f = (float)ul;`. Don't use casts unless they are absolutely needed, and even then, make sure that the cast is absolutely needed. – Pete Becker Apr 06 '22 at 16:15
  • 3
    You have the right idea, explicitly check the value before the conversion. But the limit is 2^24, not 2^32. See the link that @TedLyngmo posted. – Mark Ransom Apr 06 '22 at 16:25
  • 1
    Do not tag both C and C++ except when asking about differences or interactions between the two languages. Tag floating-point when asking about floating-point issues. – Eric Postpischil Apr 06 '22 at 21:30

1 Answers1

2

why this casting fails despite 1234567891011 being much smaller than ULONG_MAX (2^64) and FLT_MAX ? (but indeed, suspiciously larger than 2^32)

In your C implementation, float represents finite numbers as a sign, a 24-bit unsigned integer, and a scaling by a power of two ranging from 104 to −149. (This also described as a 24-bit binary numeral with its radix point fixed after the first digit and a power of two ranging from 127 to −126. These are mathematically equivalent.)

1,234,567,891,011 cannot be represented in a 24-bit integer, not scaled by any power of two. The closest value representable in the float format described above is 4,709,503•218, which equals 1,234,567,954,432. Thus, when 1,234,567,891,011 is converted to float, the result is 1,234,567,954,432.

how can FLT_MAX reach 3.402823e+38 with only 4 bytes?

How can you reach 3.402823e+38 using only only those 12 characters? How can powers of 2 reach 299 using only two digits in the exponent? How can we represent the concept of infinity using only finitely many letters in the word “infinity”?

The number of bits in an object only limits how many different things it can represent. It does not in any way control what they represent. We can take three bits and say 000 represent zero, 001 represent five, 010 represents −1, 011 represents π, 100 represents i (the square root of −1), 101 represents apple, 110 represents Mars, and 111 represents ennui.

In the float format, we say the first bit represents the sign (0 for +, 1 for −), the next eight bits represent a code mostly for the exponent of two, and the last 23 bits represent most of a fraction portion of the float representation. The exponent codes use 1 to 254 for exponents from −126 to +127 and also mean the fraction portion starts with “1.”. The exponent code 0 means the exponent is −126 and the fraction portion starts with “0.” The exponent code 255 means the value represent is either infinity or a special “Not a Number” value, depending on the last 23 bits.

Since the exponent code bits can represent exponents up to 127, a float object can represent up to 2127 times the fraction portion, which can be as high as a little less than 2 (it can be 2−2−23).

is there a clean way to handle this case, other than checking if my unsigned long is bigger than a specific threshold? (2^32 ?)

It was handled cleanly; float f = (float) ul; converted the number as well as it could. If you want different handling, you need to describe what different results you want.

Eric Postpischil
  • 195,579
  • 13
  • 168
  • 312
  • thanks for the thorough answer. By the way i did say what different result i wanted: "_I would have expected f=1234567891011.000000_". Now that confirms the only way would have to use a double instead of a float. And if i keep a float i should handle the case of my unsigned long being over 2^24 – acromarco Apr 07 '22 at 06:22