Double precision computations

Question

I am trying to compute numerically (using analytical formulae) the values of the following sequence of integrals:

     I(k,t) = int_0^{N/2-1} u^k e^(-i*u*delta*t) du

where "i" is the imaginary unit. For small k, this integral can be computed by hand, but for larger k it is more convenient to notice that there is an iterative relationship between the terms of sequence that can be derived by integration by parts. This is implemented below by the function i1.

void i1(int N, double t, double delta, double complex ** result){
unsigned int k;
(*result)=(double complex*)malloc(sizeof(double complex)*N);

if(t==0){

    for(k=0;k<N;k++){
        (*result)[k]=pow(N-2,k+1)/(pow(2,k+1)*(k+1));
    }

}
else{
    (*result)[0]=2/(delta*t)*sin(delta*(N-2)*t/4)*cexp(-I*(N-2)*t*delta/4);
    for(k=1;k<N;k++){
        (*result)[k]=I/(delta*t)*(pow(N-2,k)/pow(2,k)*cexp(-I*delta*(N-2)*t/2)-k*(*result)[k-1]);
    }

}
}

The problem is that in my case t is very small (1e-12) and delta is typically around 1e6. When testing in the case N=4, I noticed some weird results appearing for k=3, namely the results where suddenly very large, much larger than they should be as the norm of an integral is always smaller than the integral of the norm, the results of the test are printed below:

I1(0,1.0000e-12)=1.0000000000e+00+-5.0000000000e-07I 
 Norm=1.0000000000e+00 
 compare = 1.0000000000e+00

 I1(1,1.0000e-12)=5.0000000000e-01+-3.3328895199e-07I 
 Norm=5.0000000000e-01 
 compare = 5.0000000000e-01

 I1(2,1.0000e-12)=3.3342209601e-01+-2.5013324745e-07I 
 Norm=3.3342209601e-01 
 compare = 3.3333333333e-01

 I1(3,1.0000e-12)=2.4960025766e-01+-2.6628804517e+02I 
 Norm=2.6628816215e+02 
 compare = 2.5000000000e-01

k=3 not being particularly big, I computed the value of the integral by hand, but I got using the calculator and the analytical formula I obtained the same larger than expected results for the imaginary part in the case. I also realized that if I changed the order of the terms the result changed. It therefore appears to be a problem with precision, as in the iterative process there is a subtraction of very large but almost equal terms, and following what was said on this thread: How to divide tiny double precision numbers correctly without precision errors?, this can cause small errors to be amplified. However I am finding it difficult to see how to resolve the issue in my case, and was also wondering if someone could briefly explain why this occurs?

For starters, [don't cast the result of `malloc` in C](http://stackoverflow.com/questions/605845/do-i-cast-the-result-of-malloc). And as for your problem, are you sure your calculations won't exceed the range of `double`? You might want to try using `long double` instead, or [The GNU Multiple Precision Arithmetic Library](https://gmplib.org/)? — Some programmer dude, Jul 16 '14 at 12:56
The quick way would be to try out using 'long doubles' if your compiler supports them — IdeaHat, Jul 16 '14 at 12:57
Now I get ridiculously small results (10^-320) and a 'nan' in the real part of the k=0 case — Jack, Jul 16 '14 at 13:09
Sorry I forgot to put the L suffix on the test input parameters, I get much more plausible results now. (Even if the norm of the result is slightly larger than the maximum limit) However, this code is a subroutine of a much larger code that uses predominantly double precision, can I convert from double to long double ? — Jack, Jul 16 '14 at 13:20
I was shown [this link](http://www.google.com/url?q=http%3A%2F%2Fdocs.oracle.com%2Fcd%2FE19957-01%2F806-3568%2Fncg_goldberg.html&sa=D&sntz=1&usg=AFQjCNG1VFBdZss1XVEH3h822N7snsZ0Lg) on this site, and I now firmly believe anyone attempting numerical calculation with any kind of floating point precision should read it. — kwierman, Jul 16 '14 at 14:16

score 1 · Accepted Answer · answered Jul 16 '14 at 14:11

You have to be very careful with floating point addition and subtraction.

Suppose a decimal floating point with 6 digits precision (to keep things simple). Adding/subtracting a small number to/from a large one discards some or even all of the smaller. So:

5.00000E+9 + 1.45678E+4 is: 5.00000 + 0.000014 E+9 = 5.00001E+9

which is as good as it gets. But if you add a series of small numbers to a large one, then you may be better off adding the small numbers together first, and adding the result to the large number.

Subtraction of similar size numbers is another way of losing precision. So:

5.12346E+4 - 5.12345E+4 = 1.00000E-1

Now, the two numbers can be at best their real value +/- half the least significant digit, in this case 0.5E-1 -- which is a relative error of about +/-1E-6. The result of the subtraction is still +/- 0.5E-1 (we cannot reduce the error !), which is a relative error of +/- 0.5 !!!

Multiplication and division are much better behaved -- until you over-/under-flow.

But as soon as you are doing anything iterative with add/subtract, keep saying (loudly) to yourself: floating point numbers are not (entirely) like real numbers.

This is a nice answer that explains well the situation, but how do I work around this problem with addition and subtraction? — Jack, Jul 16 '14 at 14:23
There is no simple answer to that :-( You can throw more precision at the problem, and hope that it goes away. If that doesn't work, you are stuck with having to analyse your algorithm, and armed with an understanding of the limitations of add/subtract, change things round so that the relative error at each stage is acceptable. For iterative algorithms this may mean deferring some parts of the calculation until the last possible moment. — , Jul 16 '14 at 14:38

Double precision computations

1 Answers1