Confusion around using double to represent very big numbers

Question

I have a requirement to represent 256 bit whole numbers. Right now i'm using __uint128_t[2]. Check out the below code.

#include <stdio.h>
int main()
{
    __uint128_t a = 0xffffffffffffffff; a<<=64; a+=0xffffffffffffffff;
    double b = a;
    printf("Output - %lf\n", b);

    return 0;
}

The Output is Output - 340282366920938463463374607431768211456.000000. This is just 1 greater than the correct value that is 340282366920938463463374607431768211455. Now i will change the value of a.

#include <stdio.h>
int main()
{
    __uint128_t a = 0xffffffffffffffff; a<<=64;
    double b = a;
    printf("Output - %lf\n", b);
    
    return 0;
}

The Output is Output - 340282366920938463463374607431768211456.000000 which is way off from the correct value which is 340282366920938463444927863358058659840.
1 - What is happening here? Why is the first answer just 1 off from the correct answer?
2 - There is a lot of confusion when i check online as to the max value of a whole number i can represent in datatype double. What is the max value? Please give the max whole number(+ve number) either in no of bits or the value itself.
3 - Follow up to 2nd question - Lets say the answer to above is 48 bits. The reason i am using double is to divide two 48 bit numbers and get the first digit after decimal point. Can the special need to just have one digit after decimal point increase the max value?

Edit - I did read this link before posting. What is 1.8*10^308 here and 2^53 here? They are both mentioned as biggest possible integers which can be represented using double without loss of precision

Check the `sizeof(double)` and you'll find that it doesn't have 128 bits and can't possibly hold all the exact 2^128 values a `__uint128_t` can hold. It'll be an approximation. You could possibly use a `long double` to make it closer to the target, but it'll still be an approximation — Ted Lyngmo, Aug 06 '23 at 17:24
@TedLyngmo sizeof(double)=8 for me. I understand that 53 bits is for mantissa. I imagine one bit will be for sign. So, i can infer that i can represent till 2^52. My confusion is why the double is outputting 128 bit value with such accuracy. If there is something to this, maybe i an use this to represent 128+ numbers. Just wanted to understand what's happening here — Knm, Aug 06 '23 at 17:36
`%lf` is a wrong format. `%f` is the correct one. `%lf` is for `long double` and your code invokes UB — 0___________, Aug 06 '23 at 17:37
@0___________ Ok i did change the format specifier and it's still giving the same output — Knm, Aug 06 '23 at 17:39
How are you expecting to squeeze in the exact 128 bit integer into a 64 bit float without loss of precision? — Ted Lyngmo, Aug 06 '23 at 17:39
@TedLyngmo I didn't expect it. But check out the output above — Knm, Aug 06 '23 at 17:40
Ok, so you're just surprised it was that good for one particular number? There will always be a few numbers that it can approximate very good. — Ted Lyngmo, Aug 06 '23 at 17:41
@TedLyngmo Yeah but this [link](https://stackoverflow.com/questions/1848700/biggest-integer-that-can-be-stored-in-a-double) added fuel to the fire by saying something about 1.8*10^308. And i was like did i waste months of my time working with arrays to represent big numbers.. — Knm, Aug 06 '23 at 17:43
@TedLyngmo thing is i cannot use GMP library or others because I need to run this code in CUDA and i have raised an issue in github with the only legit bignum library with cuda support. SO i decided to write the code myself — Knm, Aug 06 '23 at 17:45
@Knm Yes, if there are no bignum libraries that works in CUDA you'll probably have to write one yourself I'm afraid. — Ted Lyngmo, Aug 06 '23 at 17:48
@TedLyngmo I did!! Took me 2 months to have a working example of point addition on elliptic curves and am trying to increas the speed to be in par with GMP library. my code for mod multiplication is 4x times faster than it's GMP library version but is running 3x slow for my version of mpz_invert() — Knm, Aug 06 '23 at 17:50
@TedLyngmo Just wanted to see whats the deal with double.. Then, what is 1.8*10^308 [here](https://stackoverflow.com/questions/1848700/biggest-integer-that-can-be-stored-in-a-double)? — Knm, Aug 06 '23 at 17:51
@0___________ That makes sense..somewhat.. guess ill just stick with 2^52 for double — Knm, Aug 06 '23 at 17:52
@Knm 1.8*10^308 is what it says, `DBL_MAX`. That is, the maximum number a `double` can hold. It only has 53 bits of precision though. — Ted Lyngmo, Aug 06 '23 at 17:55
@TedLyngmo lets assume 10^3 ~= 2^10. 10^308 ~= 2^1000. But then its saying 2^52 is max. — Knm, Aug 06 '23 at 17:58
@TedLyngmo This can probably mean double can "represent" 1.8*10^308 but cannot represent (1.8*10^308 - 1). If this is true, that is weird — Knm, Aug 06 '23 at 17:59
@Knm The maximum value a `double` can hold is _very_ large, but it can only represent "a few" of the numbers between the lowest and max. Perhaps this explains it better: https://en.wikipedia.org/wiki/Double-precision_floating-point_format — Ted Lyngmo, Aug 06 '23 at 18:01
@TedLyngmo Ok if that's the case, how is it storing 1.8*10^308 then if it's size is around 64 bits? — Knm, Aug 06 '23 at 18:03
@Knm Did you look at how the bits are layed out and what they mean? 1 sign, 11 exponent and 52 fraction bits. _"The 11 bit width of the exponent allows the representation of numbers between 10^−308 and 10^308"_ — Ted Lyngmo, Aug 06 '23 at 18:05
Also note: _"Integers from −2^53 to 2^53 can be exactly represented"_ — Ted Lyngmo, Aug 06 '23 at 18:10
@TedLyngmo I'll read on it more. Thank you for explaining till now.. :) — Knm, Aug 06 '23 at 18:13
Some integers greater than 2^53 can be respresented *exactly*. They have 53 *significant* bits, the rest can be taken up in the exponent. — Weather Vane, Aug 06 '23 at 19:28
... the approximate values which by necessity are often used to represent integers > 2^53 do have an **exact value**. — Weather Vane, Aug 06 '23 at 19:43

Confusion around using double to represent very big numbers

0 Answers0