Largest uint64 which can be accurately represented in a float in C/C++

Question

I understand that floating point precision has only so many bits. It comes as no surprise that the following code thinks that (float)(UINT64_MAX) and (float)(UINT64_MAX - 1) are equal. I am trying to write a function which would detect this type of, for a lack of proper term, "conversion overflow". I thought I could somehow use FLT_MAX but that's not correct. What's the right way to do this?

#include <iostream>
#include <cstdint>

int main()
{
  uint64_t x1(UINT64_MAX);
  uint64_t x2(UINT64_MAX - 1);
  float f1(static_cast<float>(x1));
  float f2(static_cast<float>(x2));
  std::cout << f1 << " == " << f2 << " = " << (f1 == f2) << std::endl;
  return 0;
}

There isn't one way to do this. A 32 bit float has 23 bits of mantissa, so you can safely store and 23bit number. For larger numbers, it depends on how they can be represented. 2^60 for instance should be no problem as that doesn't require any mantissa bits (I think). — NathanOliver, Sep 28 '22 at 21:16
What actual problem are you trying to solve by doing this? We might be able to offer a different solution. — NathanOliver, Sep 28 '22 at 21:17
It's usually a bad idea to make an equality test of floating point values. Note that `double` too doesn't have enough bits to exactly represent the range of 64-bit values. — Weather Vane, Sep 28 '22 at 21:23
@WeatherVane - yes, I know that. The equality step in my example was simply to show that, due to loss of precision in conversion, the two resulting floats are identical. This is not how the real code will behave — Paul Grinberg, Sep 28 '22 at 21:25
@NathanOliver: The format commonly used for `float`, IEEE-754 binary32 a.k.a. “single precision”, has 24-bit significands, not 23. 23 are encoded in the primary significand field and one is encoded in the exponent field, and the format can represent all 24-bit integers, not just 23. (“Significand” is the preferred term; “mantissa” is an old word for the fraction part of a logarithm. Significands are linear, and mantissas are logarithmic.) — Eric Postpischil, Sep 28 '22 at 21:28
@NathanOliver - Suppose that float has only 2 bit mantissa. That means that you can represent 1.0, 1.25, 1.5, and 1.75 with whatever exponent. That would mean that I can accurately represent the 1, 125, 1250, 12500, etc, 15, 150, 1500, etc, and 175, 1750, 1750, etc. I am trying to detect which numbers I cannot represent in a float — Paul Grinberg, Sep 28 '22 at 21:29
I don't think you need to write a function for that. You can predict it quite easily with the relative machine precision of float https://en.cppreference.com/w/cpp/types/numeric_limits/epsilon — joergbrech, Sep 28 '22 at 21:30
Do not tag both C and C++ except when asking about differences or interactions between the two languages. Pick one language and delete the other tag. Techniques for solving this problem are different in the different languages (although one may be adaptable for the other). I think I have a related answer for C++ somewhere. — Eric Postpischil, Sep 28 '22 at 21:32
`long double` may have enough bits to exactly represent the range of 64-bit integers, but is implementation defined. For example with MSVC is the same as `double`. — Weather Vane, Sep 28 '22 at 21:35
In the normal case in C with `FLT_RADIX` 2, the largest integer representable in `float` before skipping begins is `scalbnf(1, FLT_MANT_DIG)`. See [this answer](https://stackoverflow.com/questions/52267201/largest-odd-integer-that-can-be-represented-as-a-float/52267683#52267683q) for some information about abnormal cases. (It answers a different question, and its cases are not all immediately adaptable to this one.) — Eric Postpischil, Sep 28 '22 at 21:38
You can determine whether an integer can be exactly represented by finding the highest set bit and the lowest set bit of its absolute value. That gives the number of significant bits. — Weather Vane, Sep 28 '22 at 21:39
@PaulGrinberg In that case just do a round trip conversion. if `foo == static_cast(static_cast(foo))` then `foo` is capable of being exactly representable. — NathanOliver, Sep 28 '22 at 21:44
This answer may help, although its in the other direction: https://stackoverflow.com/questions/48548598/what-float-values-could-not-be-converted-to-int-without-undefined-behavior-c/48549243#48549243 — Mike Vine, Sep 28 '22 at 22:30
Do you mean the largest integer value that a `float` can hold that can be incremented by `1` with fidelity? (At some point it will either stop incrementing, or it will increment by `2`.) — Eljay, Sep 28 '22 at 23:04
@EricPostpischil • thank you for clarifying *significand* and *mantissa*. I was wondering if they were synonyms or distinct from one another. — Eljay, Sep 28 '22 at 23:06
@PaulGrinberg "suppose that float has only 2 bit mantissa. That means that you can represent 1.0, 1.25, 1.5, and 1.75 with whatever exponent. That would mean that I can accurately represent the 1, 125, 1250, 12500, etc, 15, 150, 1500, etc, and 175, 1750, 1750, " is a mistake. `float` can represent represent 1.0, 1.25, 1.5, and 1.75 with whatever **binary** exponent. — chux - Reinstate Monica, Sep 29 '22 at 07:08
@NathanOliver Round trip success is needed, but perhaps not sufficient. The intermediate `float` value is not certainly equal to the `uint64_t`. — chux - Reinstate Monica, Sep 29 '22 at 07:10
Please clarify your question. Do you want to detect whether a specific integer cannot be converted exactly to `float`, determine the largest `uint64_t` value that is also a `float` value (even though many smaller `uint64_t` values will not be `float` values (that is, they cannot be converted exactly to `float`), or determine the largest integer for which it and all small integers are converted exactly to `float`? (The next after integer that will not be converted exactly, but many other integers beyond it will be, with gaps of increasing size as the magnitude increases.) — Eric Postpischil, Sep 29 '22 at 08:03
A solution for testing whether a specific integer can be converted exactly is [here](https://stackoverflow.com/a/52583193/298225). — Eric Postpischil, Sep 29 '22 at 08:04

chux - Reinstate Monica · Answer 1 · 2022-09-29T18:03:17.177

2

Largest uint64 which can be accurately represented in a float
What's the right way to do this?

When FLT_RADIX == 2, we are looking for a uint64_t of the form below where n is the max number of bits encodable in a float value. This is usually 24. See FLT_MANT_DIG from <float.h>.

111...(total of n binary digits)...111000...(64-n bits all zero)...000.
//
//1234561234567890
0xFFFFFF0000000000, in decimal: 18446742974197923840
// e.g. 
~( (1ull << (64-FLT_MANT_DIG)) - 1)

edited Sep 29 '22 at 18:03

answered Sep 29 '22 at 06:45

chux - Reinstate Monica

143,097
13
135
256

Note that the next smaller binary32 value is 0xFFFFFE0000000000, so it’s quite possible that 0x1000000 is desired instead for a **range** check. – Davis Herring Sep 29 '22 at 06:47
@DavisHerring True, its possible to read into OP's question various goals. I focused on the title. – chux - Reinstate Monica Sep 29 '22 at 06:51

AProgrammer · Answer 2 · 2022-09-29T09:28:43.383

1

The following function gives you the highest integer exactly representable in a floating point type such that all smaller positive integers are also exactly representable.

template<typename T>
T max_representable_integer()
{
    return std::scalbn(T(1.0), std::numeric_limits<T>::digits);
}

It does the computation in the floating point as for some the result may not be representable in a uint64_t.

edited Sep 29 '22 at 09:28

answered Sep 29 '22 at 07:30

AProgrammer

51,233
8
91
143

1

Avoid using `pow` for exponentiation where exactness is required. While the floating-point format allows `pow` to produce exact results in this case, bad `pow` implementations may produce inaccurate results. `scalbn` is a better choice. – Eric Postpischil Sep 29 '22 at 08:07
The question does not ask for “the highest integer exactly representable in a floating point type such that all smaller positive integers are also exactly representable.” – Eric Postpischil Sep 29 '22 at 08:08

Largest uint64 which can be accurately represented in a float in C/C++

2 Answers2

Linked