Exceeding the range of long double and big floating point numbers

Question

Problem statement: I am working on a code that calculates big numbers. Hence, I am easily get beyond the maximum length of "long double". Here is an example below, where part of the code is given that generates big numbers:

int n;
long double summ;

  a[1]=1;
  b[1]=1; 
  c[1] = 1; //a, b, c are 1D variables of long double types 
  summ=1+c[1];
  for(n=2; n <=1760; n++){
    a[n]=n*n;
    b[n]=n;
    c[n] = c[n-1]*a[n-1]/b[n]; //Let us assume we have this kind of operation
    summ= summ+c[n]; //So basically, summ = 1+c[1]+c[2]+c[3]+...+c[1760]
  }

The intermediates values of summ and c[n] are then used to evaluate the ratio c[n]/summ for every integer n. Then, just after the above loop, I do:

    for(n=1;n<=1760;n++){
c2[n]=c[n]/summ; //summ is thus here equals to 1+c[1]+c[2]+c[3]+...+c[1760]
}

Output: If we print n, c[n] and summ, we obtain inf after n=1755 because we exceed the length of long double:

n            c[n]            summ
1752     2.097121e+4917  2.098320e+4917
1753     3.672061e+4920  3.674159e+4920
1754     6.433452e+4923  6.437126e+4923
1755     1.127785e+4927  1.128428e+4927
1756     inf             inf
1757     inf             inf
1758     inf             inf
1759     inf             inf
1760     inf             inf

Of course, if there is an overflow for c[n] and summ, I cannot evaluate the quantity of interest, which is c2[n].

Questions: Does someone see any solution for this ? How do I need to change the code so that to have finite numerical values (for arbitrary n) ? I will indeed most likely need to go to very big numbers (n can be much larger than 1760).

Proposition: I know that GNU Multiple Precision Arithmetic (GMP) might be useful but honestly found too many difficulties trying to use this (outside the field), so if there an easier way to solve this, I would be glad to read it. Otherwise, I will be forever grateful if someone could apply GMP or any other method to solve the above-mentioned problem.

`found too many difficulties trying to use this (outside the field)` which difficulties are you facing? And there are so many big decimal libraries out there: https://stackoverflow.com/q/4800255/995714, https://stackoverflow.com/q/2568446/995714, https://stackoverflow.com/q/4798777/995714. They'll be much easier to use in C++ due to the operator overloading. Depending on how large your values are and which compiler you're using, you can also use [`__float128`](https://gcc.gnu.org/onlinedocs/gcc/Floating-Types.html) in gcc, Clang... — phuclv, Oct 08 '22 at 06:35
Have you heard of _The Butterfly Effect_... Compounding those values through multiple iterations amplifies the error... I ain't gonna speculate on how many iterations you can get away with in any floating point scheme before the numbers are pretty much worthless. — Fe2O3, Oct 08 '22 at 06:38
@Fe2O3 Yes I know but after the loop I divide c[n] by the summ, so basically I normalize. But my intermediate values generated by the loop are big. — physics-lab, Oct 08 '22 at 06:52
Re: "[...] but after the loop I divide c[n] by the summ, so basically I normalize." Can you describe what you actually want to calculate? You can probably re-arrange your expression to avoid the overflow within your intermediate expressions. — chtz, Oct 08 '22 at 07:00
@phuclv Well, I am not familiar at all with GMP (seems like the "rules"/nomenclatures are quite different). I do not know the equivalent of arrows a[n], b[n] ... for instance. I will try to follow your recommandations, though. — physics-lab, Oct 08 '22 at 07:01
The product of the formula ( c * a ) multiplies values at significant powers-of-ten... Dividing by a number that is about 10^3 isn't going to trim much inaccuracy off the tail. — Fe2O3, Oct 08 '22 at 07:03
@chtz yes, I tried to do it, but did not work well because I always need to evaluate this product "c[n-1]*a[n-1]/b[n]" somewhere. To answer your question, you can see c[n] as the non-normalized probabilities (function of the integer n), and "summ" the sum of non-normalized probabilities c[n]. Then, the normalized probability for each value of n is just the non-normalized one divided by the summ. — physics-lab, Oct 08 '22 at 07:08
@Fe2O3 Yes I tried this but of course this just pushes the value of the maximum bound of n (1760) to a bit larger value... So it just displaces the problem. — physics-lab, Oct 08 '22 at 07:10
Of course you _tired it_. It's there in your code... Maybe I am not making myself understood... Iterative calculations (mult/divide) involving floating point approximations of numbers will always 'drift' from what they would be if greater precision was involved. That is why people use `double` instead of `float` and `long double` instead of `double`. Run your calcs using an 8-byte double and compare the results with your `long double` results. Yes, more confidence with `long double` numbers, but, as you say, "that just displaces the problem" of compounding the inherent error of approximations. — Fe2O3, Oct 08 '22 at 07:22
@physics-lab you don't need to get used to any nomenclatures, just use a better library like [boost multiprecision](https://www.boost.org/doc/libs/1_80_0/libs/multiprecision/doc/html/index.html) and use it. For example `cpp_bin_float<256> a[N], b[N], c[N]; c[n] = c[n-1]*a[n-1]/b[n];`. Just avoid C and use C++ which is much easier to use — phuclv, Oct 08 '22 at 07:24
So, you don't actually need the intermediate `c[n]` and `summ` values? You could for example easily calculate the quotient `sum_div_c[n] = summ[n]/c[n] = summ[n-1]/c[n] + 1 = sum_div_c[n-1]*c[n-1]/c[n] + 1 = sum_div_c[n-1] * b[n]/a[n-1] + 1.0` — chtz, Oct 08 '22 at 07:25
Since precision seems to be secondary... Have you considered using logarithms instead? — Fe2O3, Oct 08 '22 at 07:28

chtz · Answer 1 · 2022-10-08T20:49:04.417

2

NOTE: This does not exactly what OP wants. I'll leave this answer here in case someone has a similar problem.

As long as your final result and all initial values are not out of range, you can very often re-arrange your terms to avoid any overflow. In your case if you actually just want to know c2[n] = c[n]/sum[n] you can re-write this as follows:

c2[n] = c[n]/sum[n] 
      = c[n]/(sum[n-1] + c[n])                        // def. of sum[n]
      = 1.0/(sum[n-1]/c[n] + 1.0)                
      = 1.0/(sum[n-1]/(c[n-1] * a[n-1] / b[n]) + 1.0) // def. of c[n]
      = 1.0/(sum[n-1]/c[n-1] * b[n] / a[n-1] + 1.0)
      = a[n-1]/(1/c2[n-1] * b[n]  + a[n-1])           // def. of c2[n-1]
      = (a[n-1]*c2[n-1]) / (b[n] + a[n-1]*c2[n-1])

Now in the final expression neither argument grows out of range, and in fact c2 slowly converges towards 1. If the values in your question are the actual values of a[n] and b[n] you may even find a closed form expression for c2[n] (I did not check it).

To check that the re-arrangement works, you can compare it with your original formula (godbolt-link, only printing the last values): https://godbolt.org/z/oW8KsdKK6

Btw: Unless you later need all values of c2 again, there is actually no need to store any intermediate value inside an array.

edited Oct 08 '22 at 20:49

answered Oct 08 '22 at 19:06

chtz

17,329
4
26
56

Actually, the first line is not equivalent to the original post to me. Every c[n] should be divided by summ = 1+c[1]+c[2]+...+c[1760]. – physics-lab Oct 08 '22 at 20:22
I changed the initialization in the godbolt-link (now the loop starts at `n=2` as in your question. – chtz Oct 08 '22 at 20:30
Thank you for the reply. By first line I mean, the first line of your post "c2[n] = c[n]/sum[n]". According to my original post where the code is provided, every c[n] should be divided by (1+c[1]+c[2]+...+c[1760]), but in your case this is not case. To be equivalent, c2[n] should be divided by sum[1760] (but again sum[1760] gives inf as checked on godbolt) – physics-lab Oct 08 '22 at 20:35
That depends on how you initialize `sum[0]` or `sum[1]`. With `sum[1]=1+c[1]` (as in the godbolt-link) it should be equivalent to what you want to calculate. Maybe also to clarify: `sum[n]` is actually not required to calculate `c2[n]`, you only need the last line, which just depends on `c2[n-1]`, `a[n-1]` and `b[n]` -- all of them are finite. – chtz Oct 08 '22 at 20:40
Let us try for the first term: in your code `c2[1]= c[1]/sum[1] = c[1]/(1+c[1])`. In the OP, `c2[1]=c[1]/(1+c[1]+...+c[1760])`. So, this is not equivalent, right? – physics-lab Oct 08 '22 at 20:43
Ah ok, I did indeed misunderstood your problem then. – chtz Oct 08 '22 at 20:44
Indeed, in the OP, the registered value of `summ` is the value taken after the first loop, and is thus all the big sum, 1+c[1]+...+c[1760]. Of course, no problem! I was about to get crazy that maybe I did not understand my own problem. Thank you again for all the great help anyway ! (I edit my OP to make it clear on it) – physics-lab Oct 08 '22 at 20:48

Fe2O3 · Answer 2 · 2022-10-10T04:42:16.807

1

I ain't no mathematician. This is what I wrote with the results below. Looks to me that the exponent, at least, is keeping up with your long double results using my feeble only double only...

#include <stdio.h>
#include <math.h>

int main() {
    int n;
    double la[1800], lb[1800], lc[1800];

    for( n = 2; n <= 1760; n++ ) {
        lb[n] = log10(n);
        la[n] = lb[n] + lb[n];
        lc[n] = lc[n-1] + la[n-1] - lb[n];

        printf( "%4d:  %.16lf\n", n, lc[n] );
    }
    return 0;
}

/* omitted for brevity */
1750:  4910.8357954121602000
1751:  4914.0785853634488000
1752:  4917.3216235537839000
1753:  4920.5649098413542000
1754:  4923.8084440845114000
1755:  4927.0522261417700000 <<=== Take note, please.
1756:  4930.2962558718036000
1757:  4933.5405331334487000
1758:  4936.7850577857016000
1759:  4940.0298296877190000
1760:  4943.2748486988194000

EDIT (Butterfly edition)
Below is a pretty simple iterative function involving one single and one double precision float values. The purpose is to demonstrate that iterative calculations are exceedingly sensitive to initial conditions. While it seems obvious that the extra bits of the double will "hold-on", remaining closer to the results one would get with infinite precision, the compounding discrepancy between these two versions demonstrate that "demons lurking in small places" will likely remain hidden in the fantastically tiny gaps between finite representations of what is infinite.

Just a bit of fun for a rainy day.

int main() {
    float  fpi = 3.1415926535897932384626433832;
    double dpi = 3.1415926535897932384626433832;

    double thresh = 10e-8;

    for( int i = 0; i < 1000; i++ ) {
        fpi = fpi * 1.03f;
        dpi = dpi * 1.03f;
        double diff = fabs( dpi - fpi );

        if( diff > thresh) {
            printf( "%3d: %25.16lf\n", i, diff );
            thresh *= 10.0;
        }
    }
    return 0;
}

  8:        0.0000001229991486
 35:        0.0000010704333473
 90:        0.0000100210180918
192:        0.0001092634900033
229:        0.0010121794607585
312:        0.0100316228017618
367:        0.1002719746902585
453:        1.0056506423279643
520:       10.2658853083848950
609:      103.8011477291584000
667:     1073.9984381198883000
736:    10288.9632129669190000
807:   101081.5514678955100000
886:  1001512.2135009766000000
966: 10473883.3271484370000000

edited Oct 10 '22 at 04:42

answered Oct 08 '22 at 07:51

Fe2O3

6,077
2
4
20

1

Using logarithms is an option, though it makes calculating sums more complicated -- but not impossible, e.g. `lsum[n] = log(sum[n]) = log(sum[n-1] + c[n]) = log(c[n]) + log(sum[n-1]/c[n] + 1) = lc[n] + log(exp(lsum[n-1] - lc[n]) + 1)` (using natural log instead of log10). – chtz Oct 08 '22 at 08:34
@chtz Gonna havta look at that in the morning. Right now, my brain oozes out my ears with anything bigger than 3. In the end, it seems like talking about the bank balance of Gates or Bezos... "Give or take a few hundred million..." while the rest of us sigh over the costs of groceries rising. I don't see (or understand) the purpose of the OP's `summ`... It's just a very, very, very big number... approximated... – Fe2O3 Oct 08 '22 at 08:42
1

Yes, overall goal of OP is not clear for me, either. – chtz Oct 08 '22 at 08:44
@Fe2O3 Thank you for your proposition. I will try this log idea and come back if it solves the problem. Please note that I have edited my post so that (hopefully) to make more clear the use of summ. – physics-lab Oct 08 '22 at 15:29
@chtz I have edited the OP to make hopefully more clear the use of summ. Hope it helps to clarify your precedent message too. – physics-lab Oct 08 '22 at 15:31
@Fe2O3 the solution based on log works ! I should vote this answer. – physics-lab Oct 08 '22 at 19:51
@chtz thank you for this idea to deal with the summ (lsum[n]), than the summ in the OP is given by the last term of lsum[n], that is lsum[1760]. With that, it works and this gives me back the original solution. – physics-lab Oct 08 '22 at 19:54
@physics-lab "should"? Not sure what that means... – Fe2O3 Oct 08 '22 at 20:12
@Fe2O3 I mean vote your answer (but apparently need more "reputation") – physics-lab Oct 08 '22 at 20:16
@Fe2O3 I am still curious to know if there is an alternative method than using log that gives back the original solutions (such as the one proposed by chtz, or using arbitrary precision methods, like GMP). – physics-lab Oct 08 '22 at 20:27
You can safe some computation time, by just scaling down `summ` (and `c[n]`) by a power of two every time it is going to overflow. You also need to keep track of the current power, and apply that again when doing the final division. I may write a new answer based on that idea, later. – chtz Oct 08 '22 at 20:56
@Fe2O3 Thank you for the insights. First, sometimes the demon is hidden in the details (hence need of precision); second actually, I am not interested in such intermediate big values here, but the ratio of them (`c2[n]` above) :) – physics-lab Oct 08 '22 at 22:13
@physics-lab It's your project, so I'll simply wish you success finding a satisfying solution. But, I must offer this from the 1980s (a book title): "Don't Sweat the Small Stuff"... `:-)` – Fe2O3 Oct 08 '22 at 22:16
@physics-lab The closer you get to finding the demon hidden the details, the murkier become the waters... Me? I prefer to hunt butterflies... – Fe2O3 Oct 10 '22 at 04:44

Exceeding the range of long double and big floating point numbers

2 Answers2