When Will static_casting the Result of ceil Compromise the Result?

Question

static_casting from a floating point to an integer simply strips the fractional point of the number. For example static_cast<int>(13.9999999) yields 13.

Not all integers are representable as floating point numbers. For example internally the closest float to 13,000,000 may be: 12999999.999999.

In this hypothetical case, I'd expect to get an unexpected result from:

const auto foo = 12'999'999.5F;
const auto bar = static_cast<long long>(ceil(foo));

My assumption is that such a breakdown does occur at some point, if not necessarily at 13,000,000. I'd just like to know the range over which I can trust static_cast<long long>(ceif(foo))?

_"The largest representable floating-point values are exact integers in all standard floating-point formats, ..."_ from: http://en.cppreference.com/w/cpp/numeric/math/ceil — Richard Critten, Jan 05 '18 at 16:11
This is going to be implementation defined but for IEEE 754 there is a dupe target: https://stackoverflow.com/questions/3793838/which-is-the-first-integer-that-an-ieee-754-float-is-incapable-of-representing-e — NathanOliver, Jan 05 '18 at 16:17
@NathanOliver Any chance you can use some IEEE magic to explain my answer then? That's even within the range of an `int`, which is nasty. — Jonathan Mee, Jan 05 '18 at 16:27
@NathanOliver So I added a [new answer](https://stackoverflow.com/a/48200609/2642059), which explains why `ceil` doesn't work at numbers greater than `(1LL << numeric_limits::digits - 1LL) - 1LL`. Hopefully it's legible :/ — Jonathan Mee, Jan 11 '18 at 14:23

Eric Postpischil · Answer 1 · 2018-01-06T12:29:47.450

For example internally the closest float to 13,000,000 may be: 12999999.999999.

That is not possible in any normal floating-point format. The floating-point representation of numbers is equivalent to M•b^e, where b is a fixed base (e.g., 2 for binary floating-point) and M and e are integers with some restrictions on their values. In order for a value like 13,000,000-x to be represented, where x is some positive value less than 1, e must be negative (because M•b^e for a non-negative e is an integer). If so, then M•b⁰ is an integer larger than M•b^e, so it is larger than 13,000,000, and so 13,000,000 can be represented as M'•b⁰, where M' is a positive integer less than M and hence fits in the range of allowed values for M (in any normal floating-point format). (Perhaps some bizarre floating-point format might impose a strange range on M or e that prevents this, but no normal format does.)

Regarding your code:

auto test = 0LL;
const auto floater = 0.5F;

for(auto i = 0LL; i == test; i = std::ceil(i + floater)) ++test;

cout << test << endl;

When i was 8,388,608, the mathematical result of 8,388,608 + .5 is 8,388,608.5. This is not representable in the float format on your system, so it was rounded to 8,388,608. The ceil of this is 8,388,608. At this point, test was 8,388,609, so the loop stopped. So this code does not demonstrate that 8,388,608.5 is representable and 8,388,609 is not.

Behavior seems to return to normal if I do: ceil(8'388'609.5F) which will correctly return 8,388,610.

8,388,609.5 is not representable in the float format on your system, so it was rounded by the rule “round to nearest, ties to even.” The two nearest representable values are 8,388,609, and 8,388,610. Since they are equally far apart, the result was 8,388,610. That value was passed to ceil, which of course returned 8,388,610.

On Visual Studio 2015 I got 8,388,609 which is a horrifying small safe range.

In the IEEE-754 basic 32-bit binary format, all integers from -16,777,216 to +16,777,216 are representable, because the format has a 24-bit significand.

score 0 · Accepted Answer · answered Jan 11 '18 at 05:56

Floating point numbers are represented by 3 integers, cb^q where:

c is the mantissa (so for the number: 12,999,999.999999 c would be 12,999,999,999,999)
q is the exponent (so for the number: 12,999,999.999999 q would be -6)
b is the base (IEEE-754 requires b to be either 10 or 2; in the representation above b is 10)

From this it's easy to see that a floating point with the capability of representing 12,999,999.999999 also has the capability of representing 13,000,000.000000 using a c of 1,300,000,000,000 and a q of -5.

This example is a bit contrived in that the chosen b is 10, where in almost all implementations the chosen base is 2. But it's worth pointing out that even with a b of 2 the q functions as a shift left or right of the mantissa.

Next let's talk about a range here. Obviously a 32-bit floating point cannot represent all the integers represented by a 32-bit integer, as the floating point must also represent so many much larger or smaller numbers. Since the exponent is simply shifting the mantissa, a floating point number can always exactly represent every integer that can be represented by it's mantissa. Given the traditional IEEE-754 binary base floating point numbers:

A 32-bit (float) has a 24-bit mantissa so it can represent all integers in the range [-16,777,215, 16,777,215]
A 64-bit (double) has a 53-bit mantissa so it can represent all integers in the range [-9,007,199,254,740,991, 9,007,199,254,740,991]
A 128-bit (long double depending upon implementation) has a 113-bit mantissa so it can represent all integers in the range [-103,845,937,170,696,552,570,609,926,584,40,191, 103,845,937,170,696,552,570,609,926,584,40,191]

[source]

c++ provides digits as a method of finding this number for a given floating point type. (Though admittedly even a long long is too small to represent a 113-bit mantissa.) For example a float's maximum mantissa could be found by:

(1LL << numeric_limits<float>::digits) - 1LL

Having thoroughly explained the mantissa, let's revisit the exponent section to talk about how a floating point is actually stored. Take 13,000,000.0 that could be represented as:

c = 13, q = 6, b = 10
c = 130, q = 5, b = 10
c = 1,300, q = 4, b = 10

And so on. For the traditional binary format IEEE-754 requires:

The representation is made unique by choosing the smallest representable exponent that retains the most significant bit (MSB) within the selected word size and format. Further, the exponent is not represented directly, but a bias is added so that the smallest representable exponent is represented as 1, with 0 used for subnormal numbers

To explain this in the more familiar base-10 if our mantissa has 14 decimal places, the implementation would look like this:

c = 13,000,000,000,000 so the MSB will be used in the represented number
q = 6 This is a little confusing, it's cause of the bias introduced here; logically ~~q = -6~~ but the bias is set so that when q = 0 only the MSB of c is immediately to the left of the decimal point, meaning that c = 13,000,000,000,000, q = 0, b = 10 will represent 1.3
b = 10 again the above rules are really only required for base-2 but I've shown them as they would apply to base-10 for the purpose of explaination

Translated back to base-2 this means that a q of numeric_limits<T>::digits - 1 has only zeros after the decimal place. ceil only has an effect if there is a fractional part of the number.

A final point of explanation here, is the range over which ceil will have an effect. After the exponent of a floating point is larger than numeric_limits<T>::digits continuing to increase it only introduces trailing zeros to the resulting number, thus calling ceil when q is greater than or equal to numeric_limits<T>::digits - 2LL. And since we know the MSB of c will be used in the number this means that c must be smaller than (1LL << numeric_limits<T>::digits - 1LL) - 1LL Thus for ceil to have an effect on the traditional binary IEEE-754 floating point:

A 32-bit (float) must be smaller than 8,388,607
A 64-bit (double) must be smaller than 4,503,599,627,370,495
A 128-bit (long double depending upon implementation) must be smaller than 5,192,296,858,534,827,628,530,496,329,220,095

When Will static_casting the Result of ceil Compromise the Result?

2 Answers2