Is it possible to round-trip a floating point double to two decimal integers with fidelity?

Question

I am trying to discern whether it is possible to decompose a double precision IEEE floating point value into to two integers and recompose them later with full fidelity. Imagine something like this:

double foo = <inputValue>;
double ipart = 0;
double fpart = modf(foo, &ipart);

int64_t intIPart = ipart;
int64_t intFPart = fpart * <someConstant>;

double bar = ((double)ipart) + ((double)intFPart) / <someConstant>;

assert(foo == bar);

It's logically obvious that any 64-bit quantity can be stored in 128-bits (i.e. just store the literal bits.) The goal here is to decompose the integer part and the fractional part of the double into integer representations (to interface with and API whose storage format I don't control) and get back a bit-exact double when recomposing the two 64-bit integers.

I have a conceptual understanding of IEEE floating point, and I get that doubles are stored base-2. I observe, empirically, that with the above approach, sometimes foo != bar for even very large values of <someConstant>. I've been out of school a while, and I can't quite close the loop in my head terms of understanding whether this is possible or not given the different bases (or some other factor).

EDIT:

I guess this was implied/understood in my brain but not captured here: In this situation, I'm guaranteed that the overall magnitude of the double in questions will always be within +/- 2^63 (and > 2^-64). With that understanding, the integer part is guaranteed to fit within a 64-bit int type then my expectation is that with ~16 bits of decimal precision, the fractional part should be easily representable in a 64-bit int type as well.

No way to do this with ``, you would need huuge integers. You can use significant and exponent however. Check `frexp`. — zch, May 14 '13 at 13:01
Why not `uint32_t i, f; memcpy(&i, &ipart, 4); memcpy(&f, &fpart, 4);`? — , May 14 '13 at 13:03
Do you want those integers to have any specific meaning? Otherwise just copy the binary representation into a `uint64_t` or a `uint32_t`. — Kerrek SB, May 14 '13 at 13:04
Yes, they need to be decimal integers. Simply copying the bits is not possible here. — ipmcc, May 14 '13 at 13:06
You need to specify your end requirements more clearly. Your question is asking "How do I make this flawed approach work?" rather than "Here is the problem I need to solve, how do I correctly do it?" — Amardeep AC9MF, May 14 '13 at 13:40
Is there a reason you would not want to simply interpret the eight-byte encoding of the `double` as a 64-bit integer (assuming endian issues were managed)? — Eric Postpischil, May 14 '13 at 13:48
@ipmcc What exactly do you mean by "decimal integers"? An integer is an integer unless you're talking about binary coded decimal, which is a whole other conversation. — Caleb, May 14 '13 at 14:20
My use of "decimal integer" was meant to convey that I don't simply want to shove the double's 64 bits of info into a 64 bit integer type. — ipmcc, May 14 '13 at 17:19
Heh. The question that keeps on taking. I get it. I was having a brainfart. I'd delete it if I could. I guess it's a cautionary tale. — ipmcc, Jul 18 '13 at 11:46
This is an old question but it looks like the original cause for confusion was missed. It's not true that "any 64-bit quantity can be stored in 128-bits" if the meanings of those bits differ. In this case, some of the bits in a float encode an exponent. It takes a very large integer width to be able to store any result of 2^x if x is an integer even of modest width. The original goal cannot be achieved for arbitrary double values. — Praxeolitic, Oct 10 '17 at 07:43

Eric Postpischil · Accepted Answer · 2013-05-14T16:08:45.557

5

If you know the number is in [–2⁶³, +2⁶³) and the ULP (the value of the lowest bit in the number) is at least 2^-63, then you can use this:

double ipart;
double fpart = modf(foo, &ipart);

int64_t intIPart = ipart;
int64_t intFPart = fpart * 0x1p63;

double bar = intIPart + intFPart * 0x1p-63;

If you just want a couple of integers from which the value can be reconstructed and do not care about the meaning of those integers (e.g., it is not necessary that one of them be the integer part), then you can use frexp to disassemble the number into its significand (with sign) and exponent, and you can use ldexp to reassemble it:

int exp;
int64_t I = frexp(foo, &exp) * 0x1p53;
int64_t E = exp;

double bar = ldexp(I, E-53);

This code will work for any finite value of an IEEE-754 64-bit binary floating-point object. It does not support infinities or NaNs.

It is even possible to pack I and E into a single int64_t, if you want to go to the trouble.

edited May 14 '13 at 16:08

answered May 14 '13 at 13:40

Eric Postpischil

195,579
13
168
312

Using a library solution makes sense, +1 – James Brock May 14 '13 at 14:11
1

If negative numbers are allowed for `foo`, then `fpart` can be negative. Its absolute value can be stored in a `uint64_t` because it is always the same sign as `ipart` but the code should be `intFPart = fabs(fpart) * 0x1.0p64`. The computation of `bar` should involve `copysign(intFPart * 0x1.0p-64, (double)intIPart)`. – Pascal Cuoq May 14 '13 at 14:58
@PascalCuoq: That can lose the sign if `ipart` is zero. For simplicity, I changed intFPart to signed and reduced the supported range by one bit. We can update it if the OP indicates a need for more. – Eric Postpischil May 14 '13 at 16:08
Thanks. This is the closest answer to what I was looking for. – ipmcc May 14 '13 at 17:20

score 1 · Answer 2 · edited May 23 '17 at 12:02

The goal here is to decompose the integer part and the fractional part of the double into integer representations

You can't even get just the integer part or just the fractional part reliably. The problem is that you seem to misunderstand how floating point numbers are stored. They don't have an integer part and a fractional part. They have a significant digits part, called the mantissa, and an exponent. The exponent essentially scales the mantissa up or down, similar to how scientific notation works.

A double-precision floating point number has 11 bits for the exponent, giving a range of values that's something like 2^-1022...2¹⁰²³. If you want to store the integer and fractional parts, then, you'll need two integers that each have about 2¹⁰ bits. That'd be a silly way to do things, though -- most of those bits would go unused because only the bits in the mantissa are significant. Using two very long integers would let you represent all the values accross the entire range of a double with the same precision everywhere, something that you can't do with a double. You could have, for example, a very large integer part with a very small fractional part, but that's a number that a double couldn't accurately represent.

Update

If, as you indicate in your comment, you know that the value in question is within the range ±2⁶³, you can use the answer to Extract fractional part of double *efficiently* in C, like this:

double whole = // your original value
long iPart = (long)whole;
double fraction = whole - iPart;
long fPart = fraction * (2 << 63);

I haven't tested that, but it should get you what you want.

I'm quite familiar with how they're stored as a base-2 fraction, exponent, and sign bit. I guess the information that's missing is that I'm dealing with a particular practical situation in which I know that the magnitude of the double will always be well within the range of +-2^63. — ipmcc, May 14 '13 at 13:30
@ipmcc It would've been considerate if you had included that information in your question. — Caleb, May 14 '13 at 13:34
@ipmcc Unless the information that's missing also includes the knowledge that the double is above 2^-64, that only solves half of the issue. — Pascal Cuoq, May 14 '13 at 13:35

score 0 · Answer 3 · answered May 14 '13 at 13:36

0

See wikipedia for the format of a double:

http://en.wikipedia.org/wiki/Double-precision_floating-point_format

IEEE double format encodes three integers: the significand, the exponent, and the sign bit. Here is code which will extract the three constituent integers in IEEE double format:

double d = 2.0;  

// sign bit
bool s = (*reinterpret_cast<int64_t*>(&d)) >> 63;

// significand
int64_t m = *reinterpret_cast<int64_t*>(&d) & 0x000FFFFFFFFFFFFFULL;

// exponent
int64_t e = ((*reinterpret_cast<int64_t*>(&d) >> 52) & 0x00000000000007FFULL) - 1023;

// now the double d is exactly equal to s * (1 + (m / 2^52)) * 2^e
// print out the exact arithmatic expression for d:

std::cout << "d = " << std::dec << (s ? "-(1 + " : "(1 + (") << m << "/" << (1ULL << 52) << ")) x 2^" << e;

answered May 14 '13 at 13:36

James Brock

3,236
1
28
33

@PascalCuoq Sure enough, I didn't notice that the question is tagged C. This answer is C++, but the important bits are in the C subset of C++. – James Brock May 14 '13 at 13:42
1

And why use unsupported casts to access the encoding when there is a standard library function for extracting the significand and exponent? That is what `frexp` is for. – Eric Postpischil May 14 '13 at 13:50
@EricPostpischil ipmcc wants the integer expression for a double. frexp doesn't quite do that. – James Brock May 14 '13 at 13:54
1

@JamesBrock: But it enables you do it in a simple supported way. Just multiply by `0x1p53` and convert to integer, as shown in my answer. – Eric Postpischil May 14 '13 at 14:06

Is it possible to round-trip a floating point double to two decimal integers with fidelity?

3 Answers3