0

The following is a snippet from a C program.

Can someone please tell me why these two statements give different values even though the expressions are mathematically equivalent:

printf("    WARNING: sum = %e   at   time index = %d\n",C[21]*X[0] + C[22]*X[1] + C[23]*X[2] + C[24]*X[3] + C[25]*X[4] + C[26]*X[5] + C[27]*X[6],TIME_INDEX);
printf("    WARNING: sum = %e   at   time index = %d\n",C[21]*X[0] + C[27]*X[6] + C[22]*X[1] + C[26]*X[5] + C[23]*X[2] + C[25]*X[4] + C[24]*X[3],TIME_INDEX);

Here's the print out

WARNING: sum = -5.551115e-17   at   time index = 1
WARNING: sum = 0.000000e+00   at   time index = 1

The answer should be EXACTLY equal to zero. I don't understand why the order of the terms matters in the summation.

EDIT: I should say that I explicitly define that X[0] = X[6], X[1] = X[5], X[2] = X[4] prior to making the print statements.

The central difference coefficients are C[21] = -1.0/60.0; C[22] = 9.0/60.0; C[23] = -45.0/60.0; C[24] = 0.0/60.0; C[25] = 45.0/60.0; C[26] = -9.0/60.0; C[27] = 1.0/60.0;

ThatsRightJack
  • 721
  • 6
  • 29
  • there is the possibility of an overflow/underflow event. Please post code that cleanly compiles and still shows the problem. Of interest is what is the 'type' for the arrays C[] and X[] and how big are those arrays? – user3629249 Sep 17 '16 at 05:26
  • Possible duplicate of [is floating point math broken](https://stackoverflow.com/questions/588004). – user3386109 Sep 17 '16 at 05:31
  • @user3629249 They are both `double` types. The array for C contains 49 values and the array for X is dynamically allocated and problem dependent. – ThatsRightJack Sep 17 '16 at 05:32
  • Possible duplicate of [Is floating point math broken?](http://stackoverflow.com/questions/588004/is-floating-point-math-broken) – Bo Persson Sep 17 '16 at 11:58

1 Answers1

3

The cases differ because the machine processes operations individually and with limited precision. Each step along the way the value is rounded off to the nearest (as far as it could be tracked) available representation.

In one case the equal terms cancelled exactly, in another you probably had a value that was so large in magnitude one of the smaller terms couldn't be applied exactly. So you end up with a difference of about 6*10^-17, or 2^-54; which matches well with the effective 53 bits of mantissa available in IEEE 754 64-bit floating point. Your error is in fact about as small as it could possibly be.

Note that this doesn't require large numbers, only ones that can't be exactly represented in binary with the available width, such as your /60 values (1/3 and 1/5 both require infinite binary digits; 1/3 in decimal too).

This is fundamental to proper understanding of numerical analysis. You can sometimes mitigate it using more appropriate representations, like arbitrary precision (bignum) which lets you store larger values than the processor understands, or rational forms, when all your values are rational. This is why databases tend to have a separate decimal type. However, each has their own limits; 1/60 only fits in rational, of these examples.

Yann Vernier
  • 15,414
  • 2
  • 28
  • 26
  • Good to know. So for the sake of ensuring that sum is exactly equal to zero, what is the common practice from a programming point of view? Do I just need to make sure the order is consistent such that adjacent terms cancel? Is it safe to do a "if sum < eps then sum = 0.0"? – ThatsRightJack Sep 17 '16 at 05:44
  • Safe is arguable. You should understand how large the error might have grown and the risks of misinterpreting a value; in this case, you have that they're practically cancelling out. In general, exact comparison of floating point values is unreliable. It's actually harder to know if a value near zero is important or not, because it depends on the magnitude of the inputs; that might have been a significant result of values around 2^-55 just as easily as an insignificant result of values around 1. – Yann Vernier Sep 17 '16 at 05:54