Dividing uint64_t by numeric_limits::max() to a floating point representation

Question

Given a uint64_t value, is it possible to divide it by std::numeric_limits<uint64_t>::max() so that a floating point representation of the value results (0.0 to 1.0 representing 0 to 2^64-1)?

Numbers bigger than max can be chalked up to undefined behaviour, as long as every number equal to or smaller than max is correctly divided to its floating point "counterpart" (or the nearest number the floating point type is capable of representing instead of the real value)

I'm not sure casting one (or both) sides to long double will result in correct values for all valid inputs, because the standard doesn't guarantee long double to have a mantissa of 64 bits. Is this possible at all?

related: http://stackoverflow.com/questions/28652014/why-conversion-from-unsigned-long-long-to-double-can-lead-to-data-loss — NathanOliver, Jun 14 '16 at 13:04
The standard doesn't even guarantee that `long double` has 24 bits of significand, let alone 64. It may e.g. be an IEEE half-precision number (total 16 bits, 11 for complete significand, 5 for exponent), and `double` and `float` would then be less or equal in the range/precision. — Ruslan, Jun 14 '16 at 13:04
Exactly, no guarantees whatsoever. So is there _any_ other way of executing this division safely? — Stijn Frishert, Jun 14 '16 at 13:09
The answer depends. If you just need to be sure that in the case when compilation succeeds you'll get the correct result, then your friends are `static_assert` and `numeric_limits`. If you want it to work on every compliant C++ implementation, you'll have to try really hard implementing this, again using `numeric_limits` to choose code paths if you want efficiency. — Ruslan, Jun 14 '16 at 13:10
You may be interested in [`boost::multiprecision`](http://www.boost.org/doc/libs/1_61_0/libs/multiprecision/doc/html/boost_multiprecision/intro.html) — NathanOliver, Jun 14 '16 at 13:11
@Ruslan: yes, I guess that's a fair enough requirement. Care to elaborate in an answer? — Stijn Frishert, Jun 14 '16 at 13:15
@NathanOliver: yes, that seems interesting; I'll have a look. — Stijn Frishert, Jun 14 '16 at 13:15
I will probably answer, but I must ask you to clarify your question: is `int(100*value/(2^64-1))` what you're trying to compute? — Ruslan, Jun 14 '16 at 13:17
@StijnFrishert: what do you mean by "safely"? If you mean that the result must be different for all possible inputs then there's certainly no such type guaranteed by the language, but you could make static asserts about the value of `std::numeric_limits::digits`. — Steve Jessop, Jun 14 '16 at 13:17
I'm trying to compute `float(value / 2^64-1)` without truncation; percentage is a poor choice of wording. 'Safely' means that all integer values within the min() and max() range are converted _correctly_ to their floating point 'equivalent'. Anything outside that range can be declared to be _undefined behaviour_. I'll update the original post. — Stijn Frishert, Jun 14 '16 at 13:21
And by _correctly_ you mean _correctly rounded_ (i.e. to nearest, ties to even)? — Ruslan, Jun 14 '16 at 13:31
What do you mean by "correctly"? In a 64 bit floating point type, `1.0 / (2^64 - 1)` is `5.42101086242752217003726400434970855712890625e-20` or whatever, but do you consider it "correct"? How many places would be "correct", given that the exact result cannot be represented in any radix-2 floating point type? — Steve Jessop, Jun 14 '16 at 13:32
Ah, I understand you now. Yes, as correct as the floating point type is capable of representing (nearest). — Stijn Frishert, Jun 14 '16 at 13:34
So double rounding e.g. to 64-bit, then to 32-bit is a no-no? — Ruslan, Jun 14 '16 at 13:35
Your second paragraph is rather confusing, considering that `std::numeric_limits::min()` is zero, and that there can't be numbers "outside of the min-max range" (otherwise it wouldn't be a min-max range). — interjay, Jun 14 '16 at 13:35
Agreed. Updated. @Ruslan: what do you mean? I'm afraid I don't understand. — Stijn Frishert, Jun 14 '16 at 13:38
I mean there're certain cases where gradual rounding gives different result from single rounding. E.g. `1.496->1.5->2` vs `1.496->1`. Is this tolerable for you? You'll come across double rounding if you first calculate in extended precision (first rounding), and only then round to `float` (second). — Ruslan, Jun 14 '16 at 13:50
Ah, I'd prefer as little precision loss as possible, so no. :) — Stijn Frishert, Jun 14 '16 at 13:56
What am I missing here? Why isn't `static_cast(n)/numeric_limits::max()` your answer? — Jonathan Mee, Jun 14 '16 at 14:15
@JonathanMee because it reduces precision by the first cast (try e.g. with `UINT64_MAX/3`). — Ruslan, Jun 14 '16 at 14:16
@StijnFrishert Fine use a `double` or `long double` if necessary in the `static_cast` I still don't see the issue? — Jonathan Mee, Jun 14 '16 at 14:18
@JonathanMee The C++ Standard doesn't give any guarantees about range or precision of floating-point types. Only IEEE 754 does, but it's not mandatory for C++ implementation. — Ruslan, Jun 14 '16 at 14:18
@JonathanMee Because you also can't guarantee even `long double` is capable of holding UINT64_MAX. On most compilers/platforms you'll be fine, but as Ruslan says: no guarantees from the c++ standard. — Stijn Frishert, Jun 14 '16 at 14:19
Not even on most: e.g. ARM doesn't have anything larger than IEEE binary64, and it's quite commonly used. — Ruslan, Jun 14 '16 at 14:19
@Ruslan Wait I thought the C++ standard deferred to [IEEE double precision](https://en.wikipedia.org/wiki/IEEE_754-1985) (*1.8 \* 10^308*) That's easily large enough to hold 2^64 What's the problem I'm missing? — Jonathan Mee, Jun 14 '16 at 14:25
@JonathanMee you're missing the possibility of `is_iec559==false`. Also, although the range is large, there're only 53 bits in the significand of binary64, so the precision will be lost. — Ruslan, Jun 14 '16 at 14:26
@Ruslan According to http://en.cppreference.com/w/cpp/language/types#Range_of_values a standard compliant implementation supports `floats` up to *3.4 \* 10^38* and `doubles` up to *1.8 \* 10^308* compare that to `numeric_limits::max()` which is: *1.8 \* 10^19* and either standard compliant floating point representation should be sufficient to contain a `uint64_t`. — Jonathan Mee, Jun 14 '16 at 14:35
@JonathanMee it's only correct for Format==IEEE 754, as given in the table. For any other the precision and range are implementation-defined. See `Floating point types`: _float - single precision floating point type. **Usually** IEEE-754 32 bit floating point type_ (emphasis mine). Same for `double`. — Ruslan, Jun 14 '16 at 14:36
Oh, in fact, the C header `` (which is inherited by C++'s ``) does define some minimum limits: `FLT_DIG>=6`, `(L)DBL_DIG>=10`, `{FLT,DBL,LDBL}_MIN_10_EXP<=-37`, `{FLT,DBL,LDBL}_MAX_10_EXP>=37` — Ruslan, Jun 14 '16 at 14:47
@Ruslan if what you are saying is correct, then we at least know that a `double` will be large enough to contain a `uint64_t` right? — Jonathan Mee, Jun 14 '16 at 14:56
@JonathanMee large enough to not become an infinity, yes. But not necessarily precise enough to preserve all the bits. — Ruslan, Jun 14 '16 at 14:58
@Ruslan So we are agreeing that `static_cast(n)/numeric_limits::max()` should work right? — Jonathan Mee, Jun 14 '16 at 15:00
@JonathanMee nope. It can't work exactly even with `double` being IEEE binary64, since `53<64`. You'd get an approximate answer, which would then be rounded once more to get the target `float`. If you prove that this double rounding doesn't change the result, then it should work. — Ruslan, Jun 14 '16 at 15:01
@Ruslan Haha, when I read that the first time I thought you wanted me to prove that no rounding would occur when converting an integer to a floating point number. (Oops, when I read it the second time I got that too.) Of course rounding will occur. [Anything larger than 6 digits won't survive a round trip through a `float`.](http://stackoverflow.com/q/25785968/2642059) — Jonathan Mee, Jun 14 '16 at 15:09

score 3 · Accepted Answer · edited May 23 '17 at 11:52

3

Multiprecision arithmetic is not required. Within floating point arithmetic that uses less than 64 bits for the significand (aka mantissa) division by n_max=std::numeric_limits<uint64_t>::max() can be computed in an exactly rounded way (i.e. the result of the computation will be identical to the closest approximation of the exact arithmetic ratio in the target floating point format) as follows:

n/n_max = n/(2⁶⁴-1) = n/2⁶⁴/(1-2^-64) = n/2⁶⁴*(1+2^-64+2^-128+...) = n/2⁶⁴ + whatever doesn't fit in the significand

Thus the result is

n/n_max = n/2⁶⁴

The following C++ test program implements both the naive and accurate methods of computing the ratio n/n_max:

#include <climits>
#include <cmath>
#include <iostream>
#include <limits>
#include <type_traits>


template<typename F, typename U>
F map_to_unit_range_naive(U n)
{
    static_assert(std::is_floating_point<F>::value, "Result type must be a floating point type");
    static_assert(std::is_unsigned<U>::value, "Input type must be an unsigned integer type");
    return F(n)/F(std::numeric_limits<U>::max());
}

template<typename F, typename U>
F map_to_unit_range_accurate(U n)
{
    static_assert(std::is_floating_point<F>::value, "Result type must be a floating point type");
    static_assert(std::is_unsigned<U>::value, "Input type must be an unsigned integer type");
    const int UBITS = sizeof(U) * CHAR_BIT;
    return std::ldexp(F(n), -UBITS);
}

template<class F, class U>
double error_mapping_to_unit_range(U n)
{
    const F r1 = map_to_unit_range_accurate<F>(n);
    const F r2 = map_to_unit_range_naive<F>(n);
    return (1-r2/r1);
}

#define CHECK_MAPPING_TO_UNIT_RANGE(n, result_type)                     \
    std::cout << "map_to_unit_range<" #result_type ">(" #n "): err="    \
              << error_mapping_to_unit_range<result_type>(n)*100 << "%" \
              << std::endl;

int main()
{
    CHECK_MAPPING_TO_UNIT_RANGE(123u,         float);
    CHECK_MAPPING_TO_UNIT_RANGE(123ul,        float);
    CHECK_MAPPING_TO_UNIT_RANGE(1234567890u,  float);
    CHECK_MAPPING_TO_UNIT_RANGE(1234567890ul, float);
    std::cout << "\n";
    CHECK_MAPPING_TO_UNIT_RANGE(123ul,        double);
    CHECK_MAPPING_TO_UNIT_RANGE(1234567890ul, double);
    return 0;
}

The program demonstrates that the naive method is on par with the carefully crafted code:

map_to_unit_range<float>(123u): err=0%
map_to_unit_range<float>(123ul): err=0%
map_to_unit_range<float>(1234567890u): err=0%
map_to_unit_range<float>(1234567890ul): err=0%

map_to_unit_range<double>(123ul): err=0%
map_to_unit_range<double>(1234567890ul): err=0%

This may seem surprising at first, but it has a simple explanation - if the floating point type cannot represent the integral value 2^N-1 exactly, then it rounds it to 2^N, effectively resulting in an accurate division on the next step (according to the above formula).

Note that when the precision of the floating point type exceeds the size of the integer type (so that 2^N-1 can be represented exactly) the premises for the formula are not met, and the "accurate" method stops being such:

int main()
{
    CHECK_MAPPING_TO_UNIT_RANGE(123u,        double);
    CHECK_MAPPING_TO_UNIT_RANGE(1234567890u, double);
    return 0;
}

Output:

map_to_unit_range<double>(123u): err=-2.32831e-08%
map_to_unit_range<double>(1234567890u): err=-2.32831e-08%

The "error" here is coming from the "accurate" method.

Credits:

Many thanks to @interjay and @Jonathan Mee for their thorough peer review of the previous versions of this answer.

edited May 23 '17 at 11:52

Community

1
1

answered Jun 14 '16 at 13:59

Leon

31,443
4
72
97

This is essentially the correct answer: dividing by 2^n is exact for floating-point types (it's just `exponent -=64`). The main implementation problem not mentioned is the +1, do you do that before or after casting n from `uint64_t` to float? If you do it before casting, you can't get 1.0 out: `(2^64-1)+1 == 0` due to integer wraparound. But if you do it after casting, you may have a rounding error (when `float(x+1)` would have rounded up, but `float(x)`was rounded down, and `float(x)+1.0f` == `float(x)` due to limited precision in the mantissa). Solution: special-case `~input==0` – MSalters Jun 14 '16 at 14:09
How would you get 0? – Jonathan Mee Jun 14 '16 at 14:13
Where did you get the `+1` from? This will give incorrect results for small values of `n` (e.g. it will incorrectly give a non-zero result for `n=0`, and a result that is about twice as large as the correct result for `n=1`). Simply `n/2^64` will be a much closer approximation. – interjay Jun 14 '16 at 14:21
1

@Leon n==0 was just an easy-to-check example. This is incorrect for all small values of n. – interjay Jun 14 '16 at 14:32
@Leon your RHS of the first equation should say `(n+n/2^64)/2^64+whatever=n/2^64+whatever1` instead of what it has now. – Ruslan Jun 14 '16 at 14:32
@Leon Did you check? For n=1, the correct result is about 5.42e-20, but your solution gives 1.08e-19. – interjay Jun 14 '16 at 14:38
@interjay Corrected – Leon Jun 14 '16 at 14:45
Still this might have a problem of double rounding, since the final format the OP wanted to have was `float`, while the very first division by `2^64` instead of `2^64-1` introduces a roundoff error (not due to the division itself, but rather due to the "wrong" divisor). Would be nice to prove that this never gives wrong results (if it's true). – Ruslan Jun 14 '16 at 14:57
@Ruslan The proof is in the formula derivation. – Leon Jun 14 '16 at 15:01
@Leon ah, I see: coarser rounding to nearest after a truncation never gives results different from rounding without prior truncation. Now I see this is the perfect answer. – Ruslan Jun 14 '16 at 15:05
Just a note here *2^64* does not fit in a 64-bit integer. so I'm not certain how you can say that it can be computed precisely within a 64-bit floating point number. – Jonathan Mee Jun 14 '16 at 16:02
@JonathanMee 2^64 can be represented as a floating point number accurately. For example: `double a = 1 << 16; double b = a*a*a*a;` – Leon Jun 14 '16 at 16:06
@Leon Because we cannot know how an implementation defines a `double` you cannot say that. For example if 1-bit is used for sign, and 1-bit for exponent you have a maximum of *2^62* from a 64-bit floating point number. More importantly, *2^64* cannot be represented *uniquely* : http://ideone.com/ceOsb3 – Jonathan Mee Jun 14 '16 at 16:25
@JonathanMee In general you are right. But in practice, nowadays implementation of `double` will mostly conform to the [IEEE 754](https://en.wikipedia.org/wiki/IEEE_floating_point) standard. According to that standard, powers of 2 (for a reasonable range of exponents containing the value 64) are represented accurately. – Leon Jun 14 '16 at 16:30
@JonathanMee I recommend that you read this paper: https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html – Leon Jun 14 '16 at 16:33
@Leon I think the point of dispute here is you're using the word "accurately" I'm using the term "uniquely". If I ask you where Washington D.C. is and you say: "In the US" that is *accurate* but not *unique*, because countless other cities are in the US too. With a floating point number who's mantissa is 64-bits long you will be able to *uniquely* represent numbers up to *-2^64 - 1*. So I take issue with your statement: "It can be computed precisely (within 64-bit floating point arithmetic)" it cannot. It could be precisely computed in a floating point number with a 65-bit mantissa though. – Jonathan Mee Jun 14 '16 at 17:49
@JonathanMee There is a mathematically precise result and there is its closest approximation that can be represented in n-bit floating point format. By "precise computation within n-bit floating point arithmetic" I meant that the presented formula arrives at the "precise" result in that second sense. – Leon Jun 14 '16 at 17:58
@Leon Sounds like we're agreeing. Which part of that 98-page document were you suggesting that I read? Also, if `n` is a `uint64_t` you wouldn't be able to translate a `long double` of 1.0 which I understand from the question to be the upper bound of the range? – Jonathan Mee Jun 14 '16 at 18:17
@JonathanMee If you will be doing floating point computations where accuracy is important, then you should read it all (probably just skipping the proofs). Otherwise, look through the (main) [The IEEE Standard](https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html#799) section and the [Languages and Compilers](https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html#1043) subsection. – Leon Jun 15 '16 at 04:52
@JonathanMee You are right that translating 1.0 back to `UINT64_MAX` may not yield the correct result, but only if `UINT64_MAX` cannot be exactly represented in the target floating point type (and not because of computation errors). – Leon Jun 15 '16 at 04:54
@Leon I really think your `ldexp(F(n), -UBITS)` is clever, but it's range does *not* include 1.0, while the less clever `F(n)/F(numeric_limits::max())` *does* include 1.0 in it's range. Because `std::numeric_limits::max()` is *2^64 - 1* not *2^64*. This means your algorithms should *always* return an error but do not when there is not enough precision in the floating point type to uniquely represent *2^64 - 1* and *2^64*. If you were to use an `unsigned short` or `unsigned char` as your input integer type, even the `float` would fail. – Jonathan Mee Jun 15 '16 at 11:10
@JonathanMee You are right. The premise for the formula was that the floating point type cannot accommodate nmax in full precision. – Leon Jun 15 '16 at 11:40

score 1 · Answer 2 · answered Jun 14 '16 at 13:43

The easiest, most strictly portable way I believe is boost::multiprecision::cpp_bin_float_quad:

#include <boost/multiprecision/cpp_bin_float.hpp>

#include <limits>
#include <cstdint>
#include <iostream>
#include <iomanip>


int main()
{
    using Float = boost::multiprecision::cpp_bin_float_quad;

    for (std::uint64_t i = 0 ; i < 64 ; ++i)
    {
        auto v = std::uint64_t(1) << i;
        auto x = Float(v);

        x /= std::numeric_limits<std::uint64_t>::max();

        // demonstrate lossless round-trip
        auto y = x * std::numeric_limits<std::uint64_t>::max();

        std::cout << std::setprecision(std::numeric_limits<Float>::digits10)
        << (x * 100) << "% : "
        << std::hex << y.convert_to<std::uint64_t>()
        << std::endl;
    }
}

expected results:

5.42101086242752217033113759205528e-18% : 1
1.08420217248550443406622751841106e-17% : 2
2.16840434497100886813245503682211e-17% : 4
4.33680868994201773626491007364422e-17% : 8
8.67361737988403547252982014728845e-17% : 10
1.73472347597680709450596402945769e-16% : 20
3.46944695195361418901192805891538e-16% : 40
6.93889390390722837802385611783076e-16% : 80
1.38777878078144567560477122356615e-15% : 100
2.7755575615628913512095424471323e-15% : 200
5.55111512312578270241908489426461e-15% : 400
1.11022302462515654048381697885292e-14% : 800
2.22044604925031308096763395770584e-14% : 1000
4.44089209850062616193526791541169e-14% : 2000
8.88178419700125232387053583082337e-14% : 4000
1.77635683940025046477410716616467e-13% : 8000
3.55271367880050092954821433232935e-13% : 10000
7.1054273576010018590964286646587e-13% : 20000
1.42108547152020037181928573293174e-12% : 40000
2.84217094304040074363857146586348e-12% : 80000
5.68434188608080148727714293172696e-12% : 100000
1.13686837721616029745542858634539e-11% : 200000
2.27373675443232059491085717269078e-11% : 400000
4.54747350886464118982171434538157e-11% : 800000
9.09494701772928237964342869076313e-11% : 1000000
1.81898940354585647592868573815263e-10% : 2000000
3.63797880709171295185737147630525e-10% : 4000000
7.27595761418342590371474295261051e-10% : 8000000
1.4551915228366851807429485905221e-09% : 10000000
2.9103830456733703614858971810442e-09% : 20000000
5.8207660913467407229717943620884e-09% : 40000000
1.16415321826934814459435887241768e-08% : 80000000
2.32830643653869628918871774483536e-08% : 100000000
4.65661287307739257837743548967072e-08% : 200000000
9.31322574615478515675487097934145e-08% : 400000000
1.86264514923095703135097419586829e-07% : 800000000
3.72529029846191406270194839173658e-07% : 1000000000
7.45058059692382812540389678347316e-07% : 2000000000
1.49011611938476562508077935669463e-06% : 4000000000
2.98023223876953125016155871338926e-06% : 8000000000
5.96046447753906250032311742677853e-06% : 10000000000
1.19209289550781250006462348535571e-05% : 20000000000
2.38418579101562500012924697071141e-05% : 40000000000
4.76837158203125000025849394142282e-05% : 80000000000
9.53674316406250000051698788284564e-05% : 100000000000
0.000190734863281250000010339757656913% : 200000000000
0.000381469726562500000020679515313826% : 400000000000
0.000762939453125000000041359030627651% : 800000000000
0.0015258789062500000000827180612553% : 1000000000000
0.00305175781250000000016543612251061% : 2000000000000
0.00610351562500000000033087224502121% : 4000000000000
0.0122070312500000000006617444900424% : 8000000000000
0.0244140625000000000013234889800848% : 10000000000000
0.0488281250000000000026469779601697% : 20000000000000
0.0976562500000000000052939559203394% : 40000000000000
0.195312500000000000010587911840679% : 80000000000000
0.390625000000000000021175823681358% : 100000000000000
0.781250000000000000042351647362715% : 200000000000000
1.56250000000000000008470329472543% : 400000000000000
3.12500000000000000016940658945086% : 800000000000000
6.25000000000000000033881317890172% : 1000000000000000
12.5000000000000000006776263578034% : 2000000000000000
25.0000000000000000013552527156069% : 4000000000000000
50.0000000000000000027105054312138% : 8000000000000000

You'll get better performance with boost::multiprecision::float128 but it will only work on gcc (specifying -std=g++NN) or intel compilers.

score 1 · Answer 3 · edited Jun 20 '20 at 09:12

I would imply from your question:

I'm not sure casting one (or both) sides to long double will result in correct values for all valid inputs, because the standard doesn't guarantee long double to have a mantissa of 64 bits. Is this possible at all?

That what you're asking is:

Can any value representable by a `uint64_t` survive the round trip of being cast into a `long double`'s mantissa and back to a `uint64_t`?

The answer is implementation dependent. The key lies in how many digits a long double uses for it's mantissa. Fortunately C++11 provides you with a way to get that: numeric_limits<long double>::digits For example:

const auto ui64max = numeric_limits<uint64_t>::max();
const auto foo = ui64max - 1;
const auto bar = static_cast<long double>(foo) / ui64max;

cout << "Max Digits For Roundtrip Guarantee: " << numeric_limits<long double>::digits << "\nMax Digits In uint64_t: " << numeric_limits<uint64_t>::digits << "\nConverting: " << foo << "\nTo long double Mantissa: " << bar << "\nRoundtrip Back To uint64_t: " <<  static_cast<uint64_t>(bar * ui64max) << endl;

Live Example

You can validate this fact at compile time with something like:

static_assert(numeric_limits<long double>::digits >= numeric_limits<uint64_t>::digits, "long double has insufficient mantissa precision in this implementation");

For more information on the math supporting round trip questions you can look here: Float Fractional Precision

Dividing uint64_t by numeric_limits::max() to a floating point representation

3 Answers3

n/nmax = n/264

Can any value representable by a uint64_t survive the round trip of being cast into a long double's mantissa and back to a uint64_t?

n/n_max = n/2⁶⁴

Can any value representable by a `uint64_t` survive the round trip of being cast into a `long double`'s mantissa and back to a `uint64_t`?