C - Unsigned long long to double on 32-bit machine

Question

Hi I have two questions:

uint64_t vs double, which has a higher range limit for covering positive numbers?
How to convert double into uint64_t if only the whole number part of double is needed.

Direct casting apparently doesn't work due to how double is defined.

Sorry for any confusion, I'm talking about the 64bit double in C on a 32bit machine.

As for an example:

//operation for convertion I used:
double sampleRate = (
                      (union { double i; uint64_t sampleRate; })
                      { .i = r23u.outputSampleRate}
                    ).sampleRate;

//the following are printouts on command line
//         double                               uint64_t
//printed by   %.16llx                           %.16llx
outputSampleRate  0x41886a0000000000      0x41886a0000000000 sampleRate

//printed by   %f                                    %llu
outputSampleRate  51200000.000000        4722140757530509312 sampleRate

So the two numbers remain the same bit pattern but when print out as decimals, the uint64_t is totally wrong. Thank you.

Are you talking about `double` as specified in C, or `double` as in ieee754 binary64 floating-point? — EOF, Aug 05 '16 at 20:02
What do you mean by "Direct casting apparently doesn't work" ? How does it not work? Give an example of the number, and what you got when you cast it to double, and how you examined its value. — FredK, Aug 05 '16 at 20:06
`double` has a much larger "maximum". But "covering"? It is loosing precision proportionally to a logarithm of the value. So at some point even the "whole" part will suffer from imprecision. — Eugene Sh., Aug 05 '16 at 20:08
Well, I guess you should start [here](https://en.wikipedia.org/wiki/Double-precision_floating-point_format). — Eugene Sh., Aug 05 '16 at 20:25
You can't convert between `double` and `uint64_t` by using a union. That seems to be what you mean by "direct casting". You have to assign the value to the other type. — Weather Vane, Aug 05 '16 at 20:29
Of course interpreting a bit pattern as if it represents a double will give you a completely different answer if you interpret that bit pattern as an integer. — FredK, Aug 05 '16 at 20:29

John Bollinger · Accepted Answer · 2016-08-05T20:50:06.577

uint64_t vs double, which has a higher range limit for covering positive numbers?

uint64_t, where supported, has 64 value bits, no padding bits, and no sign bit. It can represent all integers between 0 and 2⁶⁴ - 1, inclusive.

Substantially all modern C implementations represent double in IEEE-754 64-bit binary format, but C does not require nor even endorse that format. It is so common, however, that it is fairly safe to assume that format, and maybe to just put in some compile-time checks against the macros defining FP characteristics. I will assume for the balance of this answer that the C implementation indeed does use that representation.

IEEE-754 binary double precision provides 53 bits of mantissa, therefore it can represent all integers between 0 and 2⁵³ - 1. It is a floating-point format, however, with an 11-bit binary exponent. The largest number it can represent is (2⁵³ - 1) * 2¹⁰²³, or nearly 2¹⁰⁷⁷. In this sense, double has a much greater range than uint64_t, but the vast majority of integers between 0 and its maximum value cannot be represented exactly as doubles, including almost all of the numbers that can be represented exactly by uint64_t.

How to convert double into uint64_t if only the whole number part of double is needed

You can simply assign (conversion is implicit), or you can explicitly cast if you want to make it clear that a conversion takes place:

double my_double = 1.2345678e48;
uint64_t my_uint;
uint64_t my_other_uint;

my_uint = my_double;
my_other_uint = (uint64_t) my_double;

Any fractional part of the double's value will be truncated. The integer part will be preserved exactly if it is representable as a uint64_t; otherwise, the behavior is undefined.

The code you presented uses a union to overlay storage of a double and a uint64_t. That's not inherently wrong, but it's not a useful technique for converting between the two types. Casts are C's mechanism for all non-implicit value conversions.

C11 draft standard n1570 *6.3.1.4 Real floating and integer 1 When a finite value of real floating type is converted to an integer type other than _Bool, the fractional part is discarded (i.e., the value is truncated toward zero). If the value of the integral part cannot be represented by the integer type, the behavior is undefined.* — EOF, Aug 05 '16 at 20:48

score 1 · Answer 2 · edited May 23 '17 at 12:22

double can hold substantially larger numbers than uint64_t, as the value range for 8 bytes IEEE 754 is 4.94065645841246544e-324d to 1.79769313486231570e+308d (positive or negative) [taken from here][more detailed explanation]. However if you do addition of small values in that range, you will be in for a surprise, because at some point the precision will not be able to represent e.g. an addition of 1 and will round down to the lower value, essentially making a loop steadily incremented by 1 non-terminating.

This code for example:

#include <stdio.h>
2 int main()
3 {
4     for (double i = 100000000000000000000000000000000.0; i < 1000000000000000000000000000000000000000000000000.0; ++i)
5         printf("%lf\n", i);
6     return 0;
7 }

gives me a constant output of 100000000000000005366162204393472.000000. That's also why we have nextafter and nexttoward functions in math.h. You can also find ceil and floor functions there, which, in theory, will allow you to solve your second problem: removing the fraction part.

However, if you really need to hold large numbers you should look at bigint implementations instead, e.g. GMP. Bigints were designed to do operations on very large integers, and operations like an addition of one will truly increment the number even for very large values.

C - Unsigned long long to double on 32-bit machine

2 Answers2