How to convert a uint64_t to a double/float between 0 and 1 with maximum accuracy (C++)?

Question

I'm writing an image class based on unsigned integers. I'm using uint8_t and uint16_t buffers currently for 8-bit and 16-bit RGBA pixels, and to convert from 16-bit to 8-bit I simply have to take the 16 bit value, divide by std::numeric_limits< uint16_t >::max() converted to a double, then multiply that by 255.

However, if I wanted to have an image with 64-bit unsigned integers for each RGBA component (I know, it's absurdly high), how would I go about finding a float/double between 0 and 1 that represents how far between 0 and the max uint64_t my pixel value is? I assume that converting to doubles wouldn't work because doubles are generally 64-bit floats, and you can't capture all 64-bit unsigned integer values in a 64-bit float. Dividing without converting to floats/doubles would just give me 0 or sometimes 1.

What is the most accurate way to find a floating point value between 0 and 1 that represents how far between 0 and the maximum possible an unsigned 64-bit value is?

*"to convert from 16-bit to 8-bit I simply have to take the 16 bit value, divide by std::numeric_limits< uint16_t >::max() converted to a double, then multiply that by 255."* Wouldn't a shift by 8 bits do the same thing? Looks like you are just taking 8 high-order bits. You could do the same with 64-bit to 8-bit conversion. — Igor Tandetnik, Oct 24 '17 at 01:57
Just some rough back-of-the-envelope considerations: If you want to just divide by 2^64, then you're only going to cover the scales 2^0 down to 2^-64, that is, 6 bits worth of exponent. Together with the 53 mantissa bits, you'll get 59 bits worth of precision, so you lose 5 bits in the process. — Kerrek SB, Oct 24 '17 at 02:05
Convert to double, then divide by 1.8446744073709552e+19. Why are you worried that a double won't be accurate enough? — Mark Ransom, Oct 24 '17 at 02:17
Both `2^32` and `2^64` should be representable as a `double` without loss of precision. Let `h=2^32` and `H=2^64`. A 64 bit value can be represented as `A*h + B` where both `A` and `B` are 32 bit values. To convert the result to a value between 0 and 1, compute `(A*h + B)/H` or `A*h/H + b/H`. Mission accomplished. — Sam Varshavchik, Oct 24 '17 at 02:43
Note that your result will always have a 53-bit significand (assuming standard IEEE-754 64-bit binary), so you are going to lose about 11 bits of precision. No algorithm can change that. And you can easily convert to `double` and divide by 2^64. The difference between that and what you are asking for, division by 2^64-1, is tiny. In a small number of cases, the floating-point number nearest n/2^64 differs from the floating-point number nearest n/(2^64-1), but only by at most one part in 2^52. That is much finer than the human eye can distinguish. Do you really need that accuracy? — Eric Postpischil, Oct 24 '17 at 10:37
@EricPostpischil: I'm not sure that's entirely accurate. You lose 11 bits for the top half of your values, the ones that actually use 64 bits. You only lose 10 bits for the next smaller half, the numbers that use 63 bits, etc. You don't get a uniform coverage of [0, 1). (The question would be much easier if you wanted a uniform [1, 2) ....) — Kerrek SB, Oct 24 '17 at 12:10
@KerrekSB: Yes, I should have said you will lose 11 bits or less. Still, it is less, so the result holds: The difference between dividing by 2^64 versus 2^64-1 is at most one part in 2^52 in the final result. I expect it is possible to examine the number, determine whether it would round down when rounding to `double` whereas dividing exactly by 2^64-1 and rounding to `double` would round up, and then adjust the number upward so that division by 2^64 matches division by 2^64-1. But I hesitate to work on that unless there is actual need for it. — Eric Postpischil, Oct 24 '17 at 12:35
Well frankly these comments use a lot of terms and concepts that I'm not very familiar with, though I'm very grateful. Anyone care to show what you're arriving at in code, that would be the best combination in your estimation of simplicity, speed, and accuracy to convert an unsigned 64-bit integer to a float/double between 0 and 1? Would bit-shifting a fixed number of bits for both the value (as numerator) and max uint64_t (as the denominator) before converting to doubles give me better accuracy than just converting both to doubles and dividing? — Thomas, Oct 24 '17 at 22:11
Converting to double and dividing will give you the maximum possible accuracy, the rules enforced by IEEE double practically guarantee it. You're overthinking this anyway. The conversion from `uint64_t` to `double` will lose a few bits, but it will be the fewest bits you can get away with - and those bits would be lost anyway at some point. The result will still be far more accurate than 99.9% of use cases will require. — Mark Ransom, Oct 25 '17 at 04:12

chux - Reinstate Monica · Answer 1 · 2021-02-02T15:41:58.680

What is the most accurate way to find a floating point value between 0 and 1 that represents how far between 0 and the maximum possible an unsigned 64-bit value is?

To map integer values in the range [0...2⁶⁴) to [0 ... 1.0) can be done directly.

Convert from uint64_t to double.

Scale by 2⁶⁴ @Mark Ransom

 #define TWO63 0x8000000000000000u 
 #define TWO64f (TWO63*2.0)

 double map(uint64_t u) {
   double y = (double) u; 
   return y/Two64f;
 }

The will map

Integer values in the range [2⁶³...2⁶⁴) to [0.5 ... 1.0): 2⁵² different double values.
Integer values in the range [2⁶²...2⁶³) to [0.25 ... 0.5): 2⁵² different double values.
Integer values in the range [2⁶¹...2⁶²) to [0.125 ... 0.25): 2⁵² different double values.
...
Integer values in the range [2⁵²...2⁵³) to [2^-12 ... 2^-11): 2⁵² different double values.
Integer values in the range [0...2⁵²) to [2^-13 ... 2^-12): 2⁵² different double values.

To map integer values in the range [0...2⁶⁴) to [0 ... 1.0] is more difficult. (Note the ] vs. ).

[Feb 2021] I see this answer needs re-explanation on upper edge cases. Potential values returned include 1.0.

So, basically divide by `2^64` (as `double`) to get what it would be with maximum of `1.0`. — Top-Master, Feb 02 '21 at 15:17

score 2 · Answer 2 · answered Oct 24 '17 at 03:04

You can get a start from the following code for Java's java.util.Random nextDouble() method. It takes 53 bits and forms a double from them:

   return (((long)next(26) << 27) + next(27))
     / (double)(1L << 53);

I would use the most significant 26 bits of your long for the shifted value, and the next 27 bits to fill in the low order bits. That discards the least significant 64-53 = 11 bits of the input.

If distinguishing very small values is especially important you could also use subnormal numbers, which nextDouble() does not return.

The C# looks very similar: `double asDouble(UInt64 v) { return ((Int64)(v >> 11)) / (double)(1L << 53); }` — Jesse Chisholm, Aug 16 '18 at 18:22

Jesse Chisholm · Answer 3 · 2018-08-16T19:52:29.930

The OP asked for C++, so here goes: (assuming the compiler knows the type __int64 which is likely a Visual Studio-ism.)

double asDouble(unsigned __int64 v)
{
    return ((__int64)(v >> 11)) / (double)(1L << 53);
}

Or, if you don't mind funky casts:

double asDouble(unsigned __int64 v)
{
    // the 0x3FF sets the exponent to the 0..1 range.
    unsigned __int64 vv == (v >> 11) | (0x3FFL << 53);
    return *(double*)&vv;
}

How to convert a uint64_t to a double/float between 0 and 1 with maximum accuracy (C++)?

3 Answers3

Linked