-4

From Computer Systems: a Programmer's Perspective:

With single-precision floating point

  • the expression (3.14f+1e10f)-1e10f evaluates to 0.0: the value 3.14 is lost due to rounding.

  • the expression (1e20f*1e20f)*1e-20f evaluates to +∞ , while 1e20f*(1e20f*1e-20f) evaluates to 1e20f.

  • How can I detect lost of precision due to rounding in both floating point addition and multiplication?

  • What is the relation and difference between underflow and the problem that I described? Is underflow only a special case of lost of precision due to rounding, where a result is rounded to zero?

Thanks.

chux - Reinstate Monica
  • 143,097
  • 13
  • 135
  • 256
Tim
  • 1
  • 141
  • 372
  • 590
  • By the time that the significands have been aligned for the addition, there is nothing of the 3.14 representable. I don't think it's to do with rounding. – Weather Vane Oct 10 '20 at 17:51
  • Note that 3.14 is "lost" in the very first operation: https://godbolt.org/z/Y4GTcs – Bob__ Oct 10 '20 at 17:51
  • Have you read this yet: https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html ? – Bob__ Oct 10 '20 at 17:56
  • *"the expression (3.14+1e10)-1e10 evaluates to 0.0"* That statement is not true in general. Within a single expression, the compiler is allowed to use additional precision to hold intermediate results. And in fact, when I print the results of that expression, the answer is 3.139999. – user3386109 Oct 10 '20 at 19:04
  • As for the question, *"How can I detect lost of precision"*, my response is "You don't". Instead you design your software to avoid the issue. – user3386109 Oct 10 '20 at 19:09
  • 1
    "Is underflow only a special case of lost of precision due to rounding," --> C has "The result underflows if the magnitude of the mathematical result is so small that the mathematical result cannot be represented, without extraordinary roundoff error, in an object of the specified type." You make like [When does underflow occur?](https://stackoverflow.com/q/42277132/2410359). – chux - Reinstate Monica Oct 11 '20 at 04:34
  • For an answer to your first question see my answer [here](https://stackoverflow.com/a/56600103/2439725) – wim Oct 11 '20 at 08:08
  • 1
    Cross-posted: https://stackoverflow.com/q/64296463/781723, https://scicomp.stackexchange.com/q/36079/4274. To anyone who finds this, you can find additional answers on scicomp. Please [do not post the same question on multiple sites](https://meta.stackexchange.com/q/64068). – D.W. Nov 29 '20 at 02:50
  • Rather than [complaining](https://cstheory.stackexchange.com/questions/47907/do-connection-and-message-coexist-in-csp-pi-calculus) about imagined abuse and persecution, **stop doing** the things you keep getting told to stop doing and people will stop telling you to stop doing them. If you don't like being reminded not to cross-post, it's incredibly simple: *don't cross-post*. You're like a speeding driver getting mad at a traffic cop rather than *slowing down*. – jonrsharpe Nov 30 '20 at 13:12

1 Answers1

1

While in mathematics, addition and multiplication of real numbers are associative operations, those operations are not associative when performed on floating point types, like float, due to the limited precision and range extension.

So the order matters.

Considering the examples, the number 10000000003.14 can't be exactly represented as a 32-bit float, so the result of (3.14f + 1e10f) would be equal to 1e10f, which is the closest representable number. Of course, 3.14f + (1e10f - 1e10f) would yeld 3.14f instead.

Note that I used the f postfix, because in C the expression (3.14+1e10)-1e10 involves double literals, so that the result would be indeed 3.14 (or more likely something like 3.14999).

Something similar happens in the second example, where 1e20f * 1e20f is already beyond the range of float (but not of double) and the succesive multiplication is meaningless, while (1e20f * 1e-20f), which is performed first in the other expression, has a well defined result (1) and the successive multiplication yelds the correct answer.

In practice, there are some precautions you adopt

  • Use a wider type. double is a best fit for most applications, unless there are other requirements.
  • Reorder the operations, if possible. For example, if you have to add many terms and you know that some of them are smaller than others, start adding those, then the others. Avoid subtraction of numbers of the same order of magnitude. In general, there may be a more accurate way to evaluate an algebraic expression than the naive one (e.g. Horner's method for polynomial evaluation).
  • If you have some sort of knowledge of the problem domain, you may already know which part of the computation may have problematic values and check if those are greater (or lower) than some limits, before performing the calculation.
  • Check the results as soon as possible. There's no point in continuing a calculation when you already have an infinite value or a NaN, or keep iterating when your target value isn't modified at all.
Bob__
  • 12,361
  • 3
  • 28
  • 42
  • Is underflow only a special case of lost of precision due to rounding, where a result is rounded to zero? Thanks. – Tim Oct 10 '20 at 19:29
  • @Tim No, underflow happens when the resulting value is too small to be represented by the type, it's outside the valid range. Rounding is a matter of precision, it involves the number of bits reserved in the type for the mantissa as opposed to the range of the exponent. A special case are [subnormal numbers](https://en.wikipedia.org/wiki/Denormal_number). – Bob__ Oct 10 '20 at 19:41
  • I am not sure I understand their difference. – Tim Oct 10 '20 at 19:43
  • 1
    @Tim To my knowledge, an underflow is a negative overflow of the exponent, rounding errors are due to limited number of bits reserved for the mantissa. – Bob__ Oct 10 '20 at 20:07
  • I asked about how to detect if the problem happens or not, not how to avoid the problem from happening. I am looking for answers like https://stackoverflow.com/questions/15655070/how-to-detect-double-precision-floating-point-overflow-and-underflow/15655590#15655590, which is for detecting floating point overflow and underflow. But you said underflow and the problem in my post are different problems. – Tim Oct 10 '20 at 22:33
  • @Tim In my last point I suggested to check the result. In my first comment to your question I linked a snippet that show exactly one of those possible checks, in case of a sum (if `x + y == x` when both `x` and `y` are non-zero, it means that `y` is too small and will be "lost"). The answer you mentioned say *"you can just do the operations, then use `isfinite` or `isinf` on the results"*, my also mention NaN. An alternative, if supported by your compiler, are the [macro constants in ](https://en.cppreference.com/w/c/numeric/fenv/FE_exceptions). – Bob__ Oct 10 '20 at 23:07
  • Thanks. " if you have to add many terms and you know that some of them are smaller than others, start adding those, then the others. Avoid subtraction of numbers of the same order of magnitude. " Why avoid subtraction of numbers of the same order of magnitude? (whereas add numbers of the same order of magnitude)? – Tim Oct 17 '20 at 00:06
  • @Tim It's due to the [loss of significance](https://en.m.wikipedia.org/wiki/Loss_of_significance#:~:text=It%20occurs%20when%20an%20operation,the%20result%20is%20reduced%20unacceptably.), which behaves differently for the two operations. – Bob__ Oct 17 '20 at 07:23