How unreliable are floating point values, operators and functions?

Question

I don't want to introduce floating point when an inexact value would be a distaster, so I have a couple of questions about when you actually can use them safely.

Are they exact for integers as long as you don't overflow the number of significant digit? Are these two tests always true:

double d = 2.0;
if (d + 3.0 == 5.0) ...
if (d * 3.0 == 6.0) ...

What math functions can you rely on? Are these tests always true:

#include <math.h>

double d = 100.0;
if (log10(d) == 2.0) ...
if (pow(d, 2.0) == 10000.0) ...
if (sqrt(d) == 10.0) ...

How about this:

int v = ...;
if (log2((double) v) > 16.0) ... /* gonna need more than 16 bits to store v */
if (log((double) v) / log(2.0) > 16.0) ... /* C89 */

I guess you can summarize this question as: 1) Can floating point types hold the exact value of all integers up to the number of their significant digits in float.h? 2) Do all floating point operators and functions guarantee that the result is the closest to the actual mathematical result?

There are some very good articles on floating point in Wikipedia and elsewhere on the net. You should read some of them. — Hot Licks, Sep 27 '14 at 03:14
Integer values are stored exactly, up to some maximum. The first block will evaluate exactly the way you'd expect. I don't know the answers to your other questions. P.S. The maximum for integers will be slightly less than the number of significant digits. — Mark Ransom, Sep 27 '14 at 03:17
@MarkRansom - Integers are exact **in IEEE float**. This may not be true of other float formats, and C does not specify the float format to be used. — Hot Licks, Sep 27 '14 at 20:36
@HotLicks That's why I worded it the way I did, without specifying an exact value or number of bits. Are you aware of *any* floating point format, no matter how obscure, that can't hold exact integers? — Mark Ransom, Sep 28 '14 at 20:26
@MarkRansom - My recollection is that the old IBM 70xx 36-bit machines didn't do exact integers. Though that's been 40 years ago, so my memory isn't real reliable. — Hot Licks, Sep 28 '14 at 22:07

score 7 · Accepted Answer · edited May 23 '17 at 12:33

7

I too find incorrect results distasteful.

On common hardware, you can rely on +, -, *, /, and sqrt working and delivering the correctly-rounded result. That is, they deliver the floating-point number closest to the sum, difference, product, quotient, or square root of their argument or arguments.

Some library functions, notably log2 and log10 and exp2 and exp10, traditionally have terrible implementations that are not even faithfully-rounded. Faithfully-rounded means that a function delivers one of the two floating-point numbers bracketing the exact result. Most modern pow implementations have similar issues. Lots of these functions will even blow exact cases like log10(10000) and pow(7, 2). Thus equality comparisons involving these functions, even in exact cases, are asking for trouble.

sin, cos, tan, atan, exp, and log have faithfully-rounded implementations on every platform I've recently encountered. In the bad old days, on processors using the x87 FPU to evaluate sin, cos, and tan, you would get horribly wrong outputs for largish inputs and you'd get the input back for larger inputs. CRlibm has correctly-rounded implementations; these are not mainstream because, I'm told, they've got rather nastier worst cases than the traditional faithfully-rounded implementations.

Things like copysign and nextafter and isfinite all work correctly. ceil and floor and rint and friends always deliver the exact result. fmod and friends do too. frexp and friends work. fmin and fmax work.

Someone thought it would be a brilliant idea to make fma(x,y,z) compute x*y+z by computing x*y rounded to a double, then adding z and rounding the result to a double. You can find this behaviour on modern platforms. It's stupid and I hate it.

I have no experience with the hyperbolic trig, gamma, or Bessel functions in my C library.

I should also mention that popular compilers targeting 32-bit x86 play by a different, broken, set of rules. Since the x87 is the only supported floating-point instruction set and all x87 arithmetic is done with an extended exponent, computations that would induce an underflow or overflow in double precision may fail to underflow or overflow. Furthermore, since the x87 also by default uses an extended significand, you may not get the results you're looking for. Worse still, compilers will sometimes spill intermediate results to variables of lower precision, so you can't even rely on your calculations with doubles being done in extended precision. (Java has a trick for doing 64-bit math with 80-bit registers, but it is quite expensive.)

I would recommend sticking to arithmetic on long doubles if you're targeting 32-bit x86. Compilers are supposed to set FLT_EVAL_METHOD to an appropriate value, but I do not know if this is done universally.

edited May 23 '17 at 12:33

Community

1
1

answered Sep 27 '14 at 05:58

tmyklebu

13,915
3
28
57

1

For `sin`, `cos`, it's not unusual for results to be completely unrelated to the true value for large inputs. For example, Intel's x87 `fcos` and `fsin` instructions are "accurate" only if you pretend that pi is exactly equal to the 66-bit approximation that Intel uses. – Mark Dickinson Sep 27 '14 at 12:03
@MarkDickinson: Thanks. I must be blocking such memories out. What horrible instructions. – tmyklebu Sep 27 '14 at 15:21
1

@MarkDickinson: In what non-contrived scenarios would code that asks for sin(X) where |x| > 3.0 not be equally happy with the sine of any value which was within a quarter-LSB of x? Most code that asks for sin(x) really wants the angle of sin(kπy) for some rational-number constant k and some variable y. Accurate calculation requires preemptive argument reduction; if the sin(x) function is passed a value larger than 3, that would seem to imply to me that calling code has already accepted needless rounding errors, and would be unlikely care about sub-LSB precision. – supercat Sep 29 '14 at 22:40
@tmyklebu: What's horrible is the use of radians rather than quadrants or circles as the unit of angle for trig functions. Complaining that Intel doesn't report a large enough value for the sine of (180.0*(Math.PI/180.0)) when the sine of 180 degrees should be *ZERO* is silly. – supercat Sep 29 '14 at 22:47
@supercat: Huh? Radians are a great way to measure angles. You're welcome to implement your own `sindeg` or whathaveyou if you really want such a thing. Thing is, doing argument reduction for radian trig functions correctly isn't very hard. – tmyklebu Sep 30 '14 at 01:23
@tmyklebu: In what non-contrived context would code actually want the sine of 884279719003555/281474976710656? I can think of a lot of cases where code wants the sine of π but can't ask for it and thus asks for the sine of 884279719003555/281474976710656 instead, *tolerating* the fact that while the sine of the angle the code wants is *zero*, the sine of 884279719003555/281474976710656 isn't. In what non-contrived cases would a sin(2πx) primitive not yield results that were at least as fast and just as precise as those using a sin(x) primitive? – supercat Sep 30 '14 at 12:31
@supercat: You're proposing that `sin` figure out what you "actually want" and trying to do that instead of just doing what it's supposed to do? – tmyklebu Sep 30 '14 at 15:31
@tmyklebu: Code which wants to accurately *request* the sine or cosine of an angle expressed in any measure other than degrees must first perform argument reduction into the range 0 to 90 degrees and then scale the result by a suitable multiple of pi, yielding a result in the range +/- pi/2. I would consider the fact that sin(x) gets passed a value outside the range +/-2 to be prima facie evidence that the programmer was more interested in speed and/or simplicity than precision, and would thus prefer a fast implementation that returned a number that was within +/-2^-53 (absolute magnitude)... – supercat Sep 30 '14 at 17:02
@supercat: Please never design a math library. – tmyklebu Sep 30 '14 at 17:07
...of being right than a slower implementation that returned a value which was more "accurate". I appreciate that *when sin(x) is passed a small angle*, finer precision in the result may be important, but I can't think of any *non-contrived* examples where an angle greater than 2 would be computed with less than +/-2^-53 worth of non-correctable rounding error. – supercat Sep 30 '14 at 17:07
@tmyklebu: Can you offer me any non-contrived situations where code would compute a value x outside the range +/-2 for which it needs to compute a sine, and would need to know the value of the sine to a precision greater than that with which x itself would be able to represent pi? For someone asked to measure the distance between two hastily-drawn pencil lines roughly 1cm apart, a ruler would generally a more appropriate tool than a micrometer even if the latter would nominally be "more accurate". – supercat Sep 30 '14 at 17:15
@supercat: I have no idea what you mean by "non-contrived." Evaluating the complex exponential function comes to mind, though. Figuring out your `y`-coordinate after you've walked around the circle for a total length of `L` does too. `sin` didn't just come out of nowhere; it's a very natural function with a very natural interpretation and a very natural power series. – tmyklebu Sep 30 '14 at 17:16
@tmyklebu: I know why radians appear in mathematics. Suppose, however, that code wants to know e.g. the Y displacement of something that moves precisely one unit per second per second after precisely 3,141,593 microseconds (a situation for which radians would seem to be ideal). Computing sin(3141593/1000000.0) would yield a result that was off by more than 1.3E-16 if the sine function were perfect [the division itself has a rounding error of more than 1.376E-16]. Code interested in accuracy should compute -sin((3141593-3141592)/1000000.0+1*6.5358979323846267E-7). – supercat Sep 30 '14 at 18:36

score 3 · Answer 2 · edited May 23 '17 at 11:46

Can floating point types hold the exact value of all integers up to the number of their significant digits in float.h?

Well, they can store the integers which fit in their mantissa (significand). So [-2^53, 2^53] for double. For more on this, see: Which is the first integer that an IEEE 754 float is incapable of representing exactly?

Do all floating point operators and functions guarantee that the result is the closest to the actual mathematical result?

They at least guarantee that the result is immediately on either side of the actual mathematical result. That is, you won't get a result which has a valid floating point value between itself and the "actual" result. But beware, because repeated operations may accumulate an error which seems counter to this, while it is not (because all intermediate values are subject to the same constraints, not just the inputs and output of a compound expression).

How unreliable are floating point values, operators and functions?

2 Answers2