Is floating point precision mutable or invariant?

Question

I keep getting mixed answers of whether floating point numbers (i.e. float, double, or long double) have one and only one value of precision, or have a precision value which can vary.

One topic called float vs. double precision seems to imply that floating point precision is an absolute.

However, another topic called Difference between float and double says,

In general a double has 15 to 16 decimal digits of precision

Another source says,

Variables of type float typically have a precision of about 7 significant digits

Variables of type double typically have a precision of about 16 significant digits

I don't like to refer to approximations like the above if I'm working with sensitive code that can break easily when my values are not exact. So let's set the record straight. Is floating point precision mutable or invariant, and why?

Just to clarify: you're asking if the precision of a floating point number will vary? — jaggedSpire, May 29 '15 at 19:12
It is stored as binary internally, so decimal precision is not accurate. — n0rd, May 29 '15 at 19:13
If you don't like approximations, use fixed-point math instead. — Michael Dorgan, May 29 '15 at 19:13
The **about** is due to the conversion from significant **bits** to significant **digits**. — Degustaf, May 29 '15 at 19:21
There's a nice series on floating point math on [this](https://randomascii.wordpress.com/2013/02/07/float-precision-revisited-nine-digit-float-portability/) blog. Due to the inexact conversion between binary and decimal representation, you're not going to really get a better answer than "about" so you might want to fully read up on the topic. — jaggedSpire, May 29 '15 at 19:27
@MichaelDorgan: if you don't like approximations, you'll need to stick to _integer_ math. Fixed-point (though somewhat easier predictable than floating) is still just an approximation to the reals/rationals which are what you really want to express, in almost all interesting applications. And [it's typically a worse approximation](http://programmers.stackexchange.com/questions/87457/why-do-you-need-float-double/87520#87520) than floating-point! — leftaroundabout, May 30 '15 at 09:53
It is slightly mutable in that denormals have less precision (the answers seem to be ignoring that, but that's probably OK). That's not what the 15 to 16 figure is referring to though. — harold, May 30 '15 at 19:48
A related question that goes into more detail of floating-point precision theory is, [Is the most significant decimal digits precision that can be converted to binary and back to decimal without loss of significance 6 or 7.225?](http://stackoverflow.com/questions/30688422/is-the-most-significant-decimal-digits-precision-that-can-be-converted-to-binary). — Wandering Fool, Jun 15 '15 at 19:47
Related: [Precision of Floating Point](http://stackoverflow.com/q/872544/183120). — legends2k, Aug 26 '15 at 14:14

C. K. Young · Accepted Answer · 2015-05-31T22:14:01.083

29

The precision is fixed, which is exactly 53 binary digits for double-precision (or 52 if we exclude the implicit leading 1). This comes out to about 15 decimal digits.

The OP asked me to elaborate on why having exactly 53 binary digits means "about" 15 decimal digits.

To understand this intuitively, let's consider a less-precise floating-point format: instead of a 52-bit mantissa like double-precision numbers have, we're just going to use a 4-bit mantissa.

So, each number will look like: (-1)^s × 2^yyy × 1.xxxx (where s is the sign bit, yyy is the exponent, and 1.xxxx is the normalised mantissa). For the immediate discussion, we'll focus only on the mantissa and not the sign or exponent.

Here's a table of what 1.xxxx looks like for all xxxx values (all rounding is half-to-even, just like how the default floating-point rounding mode works):

  xxxx  |  1.xxxx  |  value   |  2dd  |  3dd  
--------+----------+----------+-------+--------
  0000  |  1.0000  |  1.0     |  1.0  |  1.00
  0001  |  1.0001  |  1.0625  |  1.1  |  1.06
  0010  |  1.0010  |  1.125   |  1.1  |  1.12
  0011  |  1.0011  |  1.1875  |  1.2  |  1.19
  0100  |  1.0100  |  1.25    |  1.2  |  1.25
  0101  |  1.0101  |  1.3125  |  1.3  |  1.31
  0110  |  1.0110  |  1.375   |  1.4  |  1.38
  0111  |  1.0111  |  1.4375  |  1.4  |  1.44
  1000  |  1.1000  |  1.5     |  1.5  |  1.50
  1001  |  1.1001  |  1.5625  |  1.6  |  1.56
  1010  |  1.1010  |  1.625   |  1.6  |  1.62
  1011  |  1.1011  |  1.6875  |  1.7  |  1.69
  1100  |  1.1100  |  1.75    |  1.8  |  1.75
  1101  |  1.1101  |  1.8125  |  1.8  |  1.81
  1110  |  1.1110  |  1.875   |  1.9  |  1.88
  1111  |  1.1111  |  1.9375  |  1.9  |  1.94

How many decimal digits do you say that provides? You could say 2, in that each value in the two-decimal-digit range is covered, albeit not uniquely; or you could say 3, which covers all unique values, but do not provide coverage for all values in the three-decimal-digit range.

For the sake of argument, we'll say it has 2 decimal digits: the decimal precision will be the number of digits where all values of those decimal digits could be represented.

Okay, then, so what happens if we halve all the numbers (so we're using yyy = -1)?

  xxxx  |  1.xxxx  |  value    |  1dd  |  2dd  
--------+----------+-----------+-------+--------
  0000  |  1.0000  |  0.5      |  0.5  |  0.50
  0001  |  1.0001  |  0.53125  |  0.5  |  0.53
  0010  |  1.0010  |  0.5625   |  0.6  |  0.56
  0011  |  1.0011  |  0.59375  |  0.6  |  0.59
  0100  |  1.0100  |  0.625    |  0.6  |  0.62
  0101  |  1.0101  |  0.65625  |  0.7  |  0.66
  0110  |  1.0110  |  0.6875   |  0.7  |  0.69
  0111  |  1.0111  |  0.71875  |  0.7  |  0.72
  1000  |  1.1000  |  0.75     |  0.8  |  0.75
  1001  |  1.1001  |  0.78125  |  0.8  |  0.78
  1010  |  1.1010  |  0.8125   |  0.8  |  0.81
  1011  |  1.1011  |  0.84375  |  0.8  |  0.84
  1100  |  1.1100  |  0.875    |  0.9  |  0.88
  1101  |  1.1101  |  0.90625  |  0.9  |  0.91
  1110  |  1.1110  |  0.9375   |  0.9  |  0.94
  1111  |  1.1111  |  0.96875  |  1.   |  0.97

By the same criteria as before, we're now dealing with 1 decimal digit. So you can see how, depending on the exponent, you can have more or less decimal digits, because binary and decimal floating-point numbers do not map cleanly to each other.

The same argument applies to double-precision floating point numbers (with the 52-bit mantissa), only in that case you're getting either 15 or 16 decimal digits depending on the exponent.

edited May 31 '15 at 22:14

answered May 29 '15 at 19:14

C. K. Young

219,335
46
382
435

2

If you've ever worked with numbers in scientific notation using significant figures, floating point numbers are the binary equivalent of those. – jaggedSpire May 29 '15 at 19:19
3

@jaggedSpire Scientific notation usually lacks NaN, -0, +/-inf, and denormalized numbers. So, not quite equivalent. ;) – Yakk - Adam Nevraumont May 29 '15 at 19:20
2

You might want to expand on "about 15 decimal digits" since that was the question (the fact that depending on the number being represented the number of decimal digits may vary). – Guvante May 29 '15 at 19:22
1

@Yakk Fair enough. I was mostly talking about the changing accuracy of the system with respect to nearby integer values when the exponent is larger or smaller, which appears to be what Wandering Fool is asking. However, those exceptions are decidedly important. – jaggedSpire May 29 '15 at 19:22
2

This is only tru if IEC 60559 folats are used by the compiler (_ STDC_IEC_559 _ defined). This is not necessarily true for e.g. embedded systems, expecially without (compatible) FPU. – too honest for this site May 29 '15 at 19:35
@ChrisJester-Young This answer doesn't answer my question (specifically why?) and is just a duplicate of the answer found in **float vs. double precision** [1] which I included in my original post. Could you please expand on your answer? [1]: http://stackoverflow.com/questions/5098558/float-vs-double-precision – Wandering Fool May 29 '15 at 21:07
@WanderingFool Yes, I'll be happy to elaborate more when I get home. Had errands to run shortly after posting my answer. – C. K. Young May 29 '15 at 21:11
@ChrisJester-Young I'll wait until tomorrow to decide which answer I find the most helpful. Alot of this is complicated and will take me some time to verify and understand myself. – Wandering Fool May 29 '15 at 21:12
@WanderingFool I hope my added explanation, using a simplified floating-point format, makes it easier to understand and verify. :-) Please feel free to ask for any further clarification needed. – C. K. Young May 30 '15 at 18:08
@WanderingFool Also, if (as your question states) you care about precision in the decimal realm, you may prefer to use decimal floating-point if your language supports it. Then you'll have no uncertainty about the precision offered by your numbers. – C. K. Young May 30 '15 at 19:00
@ChrisJester-Young Finished reading through your wonderful explanation! I did however get caught up in this part `-1**s * 2**yyy * 1.xxxx`. Don't really know if `**` is implying multiplication or if it's formatting. It makes sense to multiple `2` by `yyy`. And maybe `-1 * s` if you mean when `s` is 1 the sign is -1 and when `s` is 0 the sign is 0? It might also be helpful to clarify that 1.xxxx is called a normalized mantissa. But besides that, you've showed in superb detail that converting from binary to decimal loses precision and changing the exponent of binary loses more precision. – Wandering Fool May 31 '15 at 21:50
@ChrisJester-Young You haven't mentioned yet whether or not the sign bit in binary can affect the precision of the number in decimal. This is something else I would be interested in knowing. Besides those clarifications I love your answer! – Wandering Fool May 31 '15 at 21:53
@WanderingFool `**` is exponentiation. I'll clarify more when I'm at a computer. – C. K. Young May 31 '15 at 21:57
1

@WanderingFool Okay, updated the post to use superscript for exponentiation now. :-) Anyway, the sign doesn't affect the precision at all: if `s == 0`, the number is positive, and if `s == 1`, the number is negative. That's all. – C. K. Young May 31 '15 at 22:16
I'm trying to understand how 2dd has complete coverage in the first table. The ones place varies between 0 and 9, while the tens place always stays as 1 and the values 2 through 9 aren't covered, right? – legends2k Aug 28 '15 at 10:50
@legends2k 2dd just means "2 decimal digits" (by which we mean significant digits, so e.g., in 0.5, only the 5 is significant, and the 0 is not). In the first table, the "2dd" refers to (in the case of 1.2, for example) the digits 1 and 2. – C. K. Young Oct 30 '15 at 02:59

score 25 · Answer 2 · edited May 31 '15 at 19:30

25

All modern computers use binary floating-point arithmetic. That means we have a binary mantissa, which has typically 24 bits for single precision, 53 bits for double precision and 64 bits for extended precision. (Extended precision is available on x86 processors, but not on ARM or possibly other types of processors.)

24, 53, and 64 bit mantissas mean that for a floating-point number between 2^k and 2^k+1 the next larger number is 2^k-23, 2^k-52 and 2^k-63 respectively. That's the resolution. The rounding error of each floating-point operation is at most half of that.

So how does that translate into decimal numbers? It depends.

Take k = 0 and 1 ≤ x < 2. The resolution is 2^-23, 2^-52, and 2^-63 which is about 1.19×10^-7, 2.2×10^-16, and 1.08×10^-19 respectively. That's a bit less than 7, 16, and 19 decimals. Then take k = 3 and
8 ≤ x < 16. The difference between two floating-point numbers is now 8 times larger. For 8 ≤ x < 10 you get just over 6, less than 15, and just over 18 decimals respectively. But for 10 ≤ x < 16 you get one decimal more!

You get the highest number of decimal digits if x is only a bit less than 2^k+1 and only a bit more than 10ⁿ, for example 1000 ≤ x < 1024. You get the lowest number of decimal digits if x is just a bit higher than 2^k and a bit less than 10ⁿ, for example ¹⁄₁₀₂₄ ≤ x < ¹⁄₁₀₀₀ . The same binary precision can produce decimal precision that varies by up to 1.3 digits or log₁₀ (2×10).

Of course, you could just read the article "What every computer scientist should know about floating-point arithmetic."

edited May 31 '15 at 19:30

sfjac

7,119
5
45
69

answered May 29 '15 at 19:45

gnasher729

51,477
5
75
98

I haven't had time to go through your math. But if it all checks out, this is a fine answer. Good job sir. – Wandering Fool May 29 '15 at 20:27
2

"All modern computers use binary floating-point arithmetic" over-states the situation. Select modern process directly support decimal floating point [Power 6,7,8](http://en.wikipedia.org/wiki/POWER8) come to mind. [IEEE 754-2008](http://en.wikipedia.org/wiki/IEEE_floating_point) added decimal FP formats over it predecessor spec. I see hardware decimal FP support expanding, albeit slowly. – chux - Reinstate Monica May 29 '15 at 21:40
@chux I'd feel comfortable with his claim that all modern computers use binary FP arithmetic, because those which offer decimal also always offer binary. From what I have been lead to understand, decimal is almost exclusively used in situations like banking where some precision (like 1/100th of a cent) has to be handled perfectly so we never gain nor lose anything. – Cort Ammon May 29 '15 at 23:03
@Cort Ammon Fair point that certainly all all modern computers use binary. Yet as used here "That means we have a binary..." implies all modern computers only use binary. FP math, with now well defined decimal formats & hardware that is cheaper every year offers new territory for us all and should not be relegated to the "nobody does that anymore and never will again" category. – chux - Reinstate Monica May 29 '15 at 23:18
@chux Even if computer hardware provided efficient decimal floating point, many _programming languages_ are specified to use binary floating-point. So, I don't think the dominance of binary FP is going away any time soon. – C. K. Young May 29 '15 at 23:34
1

@Chris Jester-Young Neither do I think dominance of binary FP is going away. My point is that considerations of decimal floating point are still valid. – chux - Reinstate Monica May 29 '15 at 23:49
1

Looks good. One fussy addition would be to point out that subnormal (aka denormalized) numbers have less precision. However these are very tiny numbers (unbelievably tiny) and the reduced precision there is generally not a concern. I've blogged extensively about floating-point. One relevant article is this one: https://randomascii.wordpress.com/2012/03/08/float-precisionfrom-zero-to-100-digits-2/ It discusses precision, including the difference between how many digits it takes to uniquely represent a float (nine) and how many decimal digits a float is guaranteed to represent (six). – Bruce Dawson Jun 01 '15 at 15:47
For those wondering where 2^(k - 23) came from, it's single precision float's binary [ULP](http://en.wikipedia.org/wiki/Unit_in_the_last_place) for a given exponent k i.e. resolution for a given k; see [here](http://stackoverflow.com/a/7017081/183120). – legends2k Aug 28 '15 at 10:07

wallyk · Answer 3 · 2015-05-29T23:32:44.290

9

80x86 code using its hardware coprocessor (originally the 8087) provide three levels of precision: 32-bit, 64-bit, and 80-bit. Those very closely follow the IEEE-754 standard of 1985. The recent standard specifies a 128-bit format. The floating point formats have 24, 53, 65, and 113 mantissa bits which correspond to 7.22, 15.95, 19.57, and 34.02 decimal digits of precision.

The formula is mantissa_bits / log_2 10 where the log base two of ten is 3.321928095.

While the precision of any particular implementation does not vary, it may appear to when a floating point value is converted to decimal. Note that the value 0.1 does not have an exact binary representation. It is a repeating bit pattern (0.0001100110011001100110011001100...) like we are used to in decimal for 0.3333333333333 to approximate 1/3.

Many languages often don't support the 80-bit format. Some C compilers may offer long double which uses either 80-bit floats or 128-bit floats. Alas, it might also use a 64-bit float, depending on the implementation.

The NPU has 80 bit registers and performs all operations using the full 80 bit result. Code which calculates within the NPU stack benefit from this extra precision. Unfortunately, poor code generation—or poorly written code— might truncate or round intermediate calculations by storing them in a 32-bit or 64-bit variable.

edited May 29 '15 at 23:32

answered May 29 '15 at 20:15

wallyk

56,922
16
83
148

It's too bad ANSI C didn't define a means by which a `...` parameter could indicate whether floating-point arguments should be coerced to `double` or `long double`. The lack of such a feature makes extremely difficult to consistently write correct code that uses floating-point expressions in "printf" on any platform where `long double` isn't a synonym for `double`. IMHO, that more than anything is responsible for the degradation of floating-point semantics since the 1980s. – supercat May 29 '15 at 22:14
@supercat: Well, at least C is consistent: The sizes of `short int`, `int`, and `long int` are similarly murky. – wallyk May 29 '15 at 23:34
3

@supercat What are you talking about? Float becomes double, double remains double, and long double remains long double, always. There's no ambiguity. You have to be explicit if you want a long double. – Random832 May 30 '15 at 00:36
@Random832: If library provides constants for things like "pi" or the number of centimeters per inch should they be `double` or `long double`? I would suggest that if all floating-point arguments to `printf` were converted to the longest supported floating-point type, *as was the original design intention of C*, there would be no reason not to declare constants to be of type `long double`; code `double f=CM_PER_INCH;` could do the conversion at compile-time; `double cm=inches*CM_PER_INCH;` might be slower than if `CM_PER_INCH` was `double`, but would likely be more accurate. – supercat May 30 '15 at 16:18
1

`M_PI` etc are required to be double. GNU libc provides `M_PIl` for the long double version. – Random832 May 30 '15 at 16:25
Code wishing to trade accuracy for speed could use `double cm=inches*(double)CM_PER_INCH`. There's only one common situation in which code would particularly have to know or care whether `CM_PER_INCH` was `double` or `long double` and that's in `printf`. If `CM_PER_INCH` is `long double`, then `printf("%8.4f %9.4f", size, size*CM_PER_INCH)`; will only work if the type `CM_PER_INCH` is synonymous with `double`. Since a fair number of libraries defined `long double` constants, and a fair amount of code using them would break if those constants weren't printf-compatible with `double`. – supercat May 30 '15 at 16:26
...a less convenient format than `double`, given that it's a better computational format. The only disadvantage of 80-bit format is that it's somewhat awkward to store in RAM, but since the main purpose of the type is to hold short-term intermediate values that shouldn't really be a problem in situations where it's the proper type to use. – supercat May 30 '15 at 16:31
@Random832 -- There is nothing in either the C or C++ that requires M_PI to be defined as a float, a double, or a long double. There is nothing in either the C or C++ that requires M_PI to be defined, period. POSIX has such a requirement, but the C and C++ standards do not. – David Hammen May 30 '15 at 21:24
@wallyk I've never seen the formula `mantissa_bits / log_2 10` to calculate decimal precision before, but I like it! I would like to know how to derive it. I tried to derive it myself but I'm doing it wrong. Would you happen to know its derivation or a reference to it online? – Wandering Fool Jun 01 '15 at 02:38
@WanderingFool: It seems straightforward and obvious to me, but that could be because a math teacher derived it for me in high school using basic exponential relationships. I searched a little bit just now, but only see other people applying the relation but not deriving it or giving a source for it. – wallyk Jun 01 '15 at 08:36
1

@wallyk I asked about the above formula on another stackexchange server under mathematics. If you or anyone else cares to explain this to me, you can answer my topic [How to understand or derive the formula Mantissa bits / log2 10 = Decimal digits of precision?](http://math.stackexchange.com/questions/1308211/how-to-understand-or-derive-the-formula-mantissa-bits-log-2-10). – Wandering Fool Jun 01 '15 at 20:02
1

@wallyk This question, [Is the most significant decimal digits precision that can be converted to binary and back to decimal without loss of significance 6 or 7.225?](http://stackoverflow.com/questions/30688422/is-the-most-significant-decimal-digits-precision-that-can-be-converted-to-binary) might be of some interest to you. The answer explains in more detail what the 7.225 value is. – Wandering Fool Jun 15 '15 at 19:38

score 8 · Answer 4 · edited Jun 05 '15 at 15:27

Is floating point precision mutable or invariant, and why?

Typically, given any numbers in the same power-of-2 range, the floating point precision is invariant - a fixed value. The absolute precision changes with each power-of-2 step. Over the entire FP range, the precision is approximately relative to the magnitude. Relating this relative binary precision in terms of a decimal precision incurs a wobble varying between DBL_DIG and DBL_DECIMAL_DIG decimal digits - Typically 15 to 17.

What is precision? With FP, it makes most sense to discuss relative precision.

Floating point numbers have the form of:

Sign * Significand * pow(base,exponent)

They have a logarithmic distribution. There are about as many different floating point numbers between 100.0 and 3000.0 ( a range of 30x) as there are between 2.0 and 60.0. This is true regardless of the underlying storage representation.

1.23456789e100 has about the same relative precision as 1.23456789e-100.

Most computers implemment double as binary64. This format has 53 bits of binary precision.

The n numbers between 1.0 and 2.0 have the same absolute precision of 1 part in ((2.0-1.0)/pow(2,52).
Numbers between 64.0 and 128.0, also n, have the same absolute precision of 1 part in ((128.0-64.0)/pow(2,52).

Even group of numbers between powers of 2, have the same absolute precision.

Over the entire normal range of FP numbers, this approximates a uniform relative precision.

When these numbers are represented as decimal, the precision wobbles: Numbers 1.0 to 2.0 have 1 more bit of absolute precision than numbers 2.0 to 4.0. 2 more bits than 4.0 to 8.0, etc.

C provides DBL_DIG, DBL_DECIMAL_DIG, and their float and long double counterparts. DBL_DIG indicates the minimum relative decimal precision. DBL_DECIMAL_DIG can be thought of as the maximum relative decimal precision.

Typically this means given double will have at 15 to 17 decimal digits of precision.

Consider 1.0and its next representable double, the digits do not change until the 17th significant decimal digit. Each next double is pow(2,-52) or about 2.2204e-16 apart.

/*
1 234567890123456789 */
1.000000000000000000...
1.000000000000000222...

Now consider "8.521812787393891"and its next representable number as a decimal string using 16 significant decimal digits. Both of these strings, converted to double are the same 8.521812787393891142073699... even though they differ in the 16th digit. Saying this double had 16 digits of precision was over-stated.

/*
1 234567890123456789 */
8.521812787393891
8.521812787393891142073699...
8.521812787393892

You've given me more to think about. I've read through all of the answers so far and I'm finding inconsistencies and contradictions between some of these answers. I'm gonna work on verifying all these claims and proofs today and hold off from choosing what I believe is the best answer for tomorrow. — Wandering Fool, May 29 '15 at 21:21
@Wandering Fool Wise move for a "fool". You have hit a subtle, yet deep question - would not even mind if you waited over the week-end before selecting. Likely more good answers are yet to arrive. A lot of the complexities get into that people think in decimal, yet computer use binary, Math has infinite precision and computers are finite. — chux - Reinstate Monica, May 29 '15 at 21:29
"If the fool would persist in his folly he would become wise" - William Blake, 1776, Proverbs of Hell. — Wandering Fool, May 29 '15 at 21:54

Hans Passant · Answer 5 · 2015-06-08T07:41:58.037

No, it is variable. Starting point is the very weak IEEE-754 standard, it only nailed down the format of floating pointer numbers as they are stored in memory. You can count on 7 digits of precision for single precision, 15 digits for double precision.

But a major flaw in that standard is that it does not specify how calculations are to be performed. And there's trouble, the Intel 8087 floating point processor in particular has caused programmers many sleepless nights. A significant design flaw in that chip is that it stores floating point values with more bits than the memory format. 80 bits instead of 32 or 64. The theory behind that design choice is that this allows to be intermediate calculations to be more precise and cause less round-off error.

Sounds like a good idea, that however did not turn out well in practice. A compiler writer will try to generate code that leaves intermediate values stored in the FPU as long as possible. Important to code speed, storing the value back to memory is expensive. Trouble is, he often must store values back, the number of registers in the FPU are limited and the code might cross a function boundary. At which point the value gets truncated back and loses a lot of precision. Small changes to the source code can now produce drastically different values. Also, the non-optimized build of a program produces different results from the optimized one. In a completely undiagnosable way, you'd have to look at the machine code to know why the result is different.

Intel redesigned their processor to solve this problem, the SSE instruction set calculates with the same number of bits as the memory format. Slow to catch on however, redesigning the code generator and optimizer of a compiler is a significant investment. The big three C++ compilers have all switched. But for example the x86 jitter in the .NET Framework still generates FPU code, it always will.

Then there is systemic error, losing precision as inevitable side-effect of the conversion and calculation. Conversion first, humans work in numbers in base 10 but the processor uses base 2. Nice round numbers we use, like 0.1 cannot be converted to nice round numbers on the processor. 0.1 is perfect as a sum of powers of 10 but there is no finite sum of powers of 2 that produce the same value. Converting it produces an infinite number of 1s and 0s in the same manner that you can't perfectly write down 10 / 3. So it needs to be truncated to fit the processor and that produces a value that's off by +/- 0.5 bit from the decimal value.

And calculation produces error. A multiplication or division doubles the number of bits in the result, rounding it to fit it back into the stored value produces +/- 0.5 bit error. Subtraction is the most dangerous operation and can cause loss of a lot of significant digits. If you, say, calculate 1.234567f - 1.234566f then the result has only 1 significant digit left. That's a junk result. Summing the difference between numbers that have nearly the same value is a very common in numerical algorithms.

Getting excessive systemic errors is ultimately a flaw in the mathematical model. Just as an example, you never want to use Gaussian elimination, it is very unfriendly to precision. And always consider an alternative approach, LU Decomposition is an excellent approach. It is however not that common that a mathematician was involved in building the model and accounted for the expected precision of the result. A common book like Numerical Recipes also doesn't pay enough attention to precision, albeit that it indirectly steers you away from bad models by proposing the better one. In the end, a programmer often gets stuck with the problem. Well, it was easy then anybody could do it and I'd be out of a good paying job :)

[Single-precision floating-point IEEE-754 standard](http://en.wikipedia.org/wiki/Single-precision_floating-point_format) says, "This gives from 6 to 9 significant decimal digits precision." Is this in error? Because you said I can count on 7 digits of precision for single precision using the IEEE-754 standard. — Wandering Fool, Jun 06 '15 at 20:16
Ugh, Wikipedia. They got it right just a bit further down, 7.225 digits. — Hans Passant, Jun 06 '15 at 20:19
This is an important part of floating point that I'm still trying to understand (the minimum guaranteed precision). The part where it says 7.225 decimal digits, it calls that total precision for single-precision floating-point format. What does total precision mean? Does it mean the absolute minimum precision? Is it average precision or something else? If you know the answer, could you also share a reference to this 7.225 value that explains what it really is? This is the very last thing I'm hung up on in floating point theory. — Wandering Fool, Jun 06 '15 at 20:25
It is the maximum guaranteed precision. There is no minimum, I pointed out how calculations lose precision. In a hurry for some subtractions, always half a bit due to round-off for other calculations. — Hans Passant, Jun 06 '15 at 20:31
Yes, the maximum guaranteed precession, sorry I worded that incorrectly. Assuming Wikipedia got the value of maximum guaranteed precision of 6 wrong, could you refer me to another source that explains what the maximum guaranteed decimal precision of floating point numbers are? — Wandering Fool, Jun 06 '15 at 20:48
It is just simple math. A float has 24 binary bits in the mantissa so can represent pow(2,24) distinct values. Which is log10(16777216) = 7.2 decimal digits. — Hans Passant, Jun 06 '15 at 20:56
I'm confused again, you say there is no minimum guaranteed precision, what does that mean? Are you saying that floating point numbers could have less than 6 precision? I'm misunderstanding the meaning. — Wandering Fool, Jun 06 '15 at 21:03
Every calculation loses precision due to rounding the result and fitting it back into 24 bits. So if you do a multiplication then the result is precise to 23.5 binary bits +/- 0.5 bit. Use the same simple math, you now have 7.07 decimal digits of precision. Do another multiplication, you now have 23 binary bits +/- 1 bit or 6.92 decimal digits of precision. This keeps going down the more you calculate. Faster with subtraction. There is no minimum since it entirely depends on the calculation. — Hans Passant, Jun 06 '15 at 21:07
Ok, that makes a lot more sense. If I perform no operations on my floating-point number and convert it from decimal to binary to decimal again without losing precision, would the range be what was stated on the wiki page to be 6 to 9? If not what would it be and How would I calculate this? — Wandering Fool, Jun 06 '15 at 21:12
You should know enough by now to realize that the Wikipedia statement is nonsense. You cannot get 9 digits of precision out of a float, only 7. — Hans Passant, Jun 06 '15 at 21:19
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/79867/discussion-between-wandering-fool-and-hans-passant). — Wandering Fool, Jun 06 '15 at 21:20
For anyone who is confused about the 7.225 and 6 values, check out the answer to this question, [Is the most significant decimal digits precision that can be converted to binary and back to decimal without loss of significance 6 or 7.225?](http://stackoverflow.com/questions/30688422/is-the-most-significant-decimal-digits-precision-that-can-be-converted-to-binary). It explains why the correct value is 6. — Wandering Fool, Jun 15 '15 at 19:33

too honest for this site · Answer 6 · 2015-05-30T17:14:45.367

The type of a floating point variable defines what range of values and how many fractional bits (!) can be represented. As there is no integer relation between decimal and binary fraction, the decimal fraction is actually an approximation.

Second: Another problem is the precision arithmetic operations are performed. Just think of 1.0/3.0 or PI. Such values cannot be represented with a limited number of digits - neither decimal, nor binary. So the values have to be rounded to fit into the given space. The more fractional digits are available, the higher the precision.

Now think of multiple such operations being applied, e.g. PI/3.0 . This would require to round twice: PI as such is not exact and the result neither. This will loose precision twice, if repreated it becomes worse.

So, back to float and double: float has according to the standard (C11, Annex F, also for the rest) less bits available, so roundig will be less precise than for double. Just think of having a decimal with 2 fractional digits (m.ff, call it float) and one with four (m.ffff, call it double). If double is used for all calculations, you can have more operations until your result has only 2 correct fractional digits, than if you already start with float, even if a float result would suffice.

Note that on some (embedded) CPUs like ARM Cortex-M4F, the hardware FPU only supports folat (single precision), so double arithmetic will be much more costly. Other MCUs have no hardware floating point calculator at all, so they have to be simulated my software (very costly). On most GPUs, float is also much cheaper to perform than double, sometimes by more than a factor of 10.

score 5 · Answer 7 · edited Jun 01 '15 at 01:36

5

The storage has a precise digit count in binary, as other answers explain.

One thing to know, the CPU can run operations at a different precision internally, like 80 bits. It means that code like that can trigger :

void Kaboom( float a, float b, float c ) // same is true for other floating point types.
{
    float sum1 = a+b+c;
    float sum2 = a+b;
    sum2 += c; // let's assume that the compiler did not keep sum2 in a register and the value was write to memory then load again.
    if (sum1 !=sum2)
        throw "kaboom"; // this can happen.
}

It is more likely with more complex computation.

edited Jun 01 '15 at 01:36

Wandering Fool

2,170
3
18
48

answered May 29 '15 at 20:32

galop1n

8,573
22
36

True. Note: `FLT_EVAL_METHOD` indicates use of greater precision/range FP types. – chux - Reinstate Monica May 29 '15 at 23:10

score 4 · Answer 8 · answered May 29 '15 at 19:41

I'm going to add the off-beat answer here, and say that since you've tagged this question as C++, there is no guarantee whatsoever about precision of floating point data. The vast majority of implementations use IEEE-754 when implementing their floating point types, but that is not required. The only thing required by the C++ language is that (C++ spec §3.9.1.8):

There are three ﬂoating point types: float, double, and long double. The type double provides at least as much precision as float, and the type long double provides at least as much precision as double. The set of values of the type float is a subset of the set of values of the type double; the set of values of the type double is a subset of the set of values of the type long double. The value representation of ﬂoating-point types is implementation-deﬁned. Integral and ﬂoating types are collectively called arithmetic types. Specializations of the standard template std::numeric_limits (18.3) shall specify the maximum and minimum values of each arithmetic type for an implementation.

"there is no guarantee whatsoever about precision of floating point data" does not consider the C spec about `DBL_DIG`, and others. They effectively describe the minimum decimal precision of a `float`, `double`, etc. (It is precisely defined in §5.2.4.2.2) Further, the value is at least 10. So a C program can be confident that `double` is **guaranteed** to have at least 10 decimal digits of precision. — chux - Reinstate Monica, May 29 '15 at 23:02

supercat · Answer 9 · 2015-05-31T20:38:57.463

The amount of space required to store a float will be constant, and likewise a double; the amount of useful precision will in relative terms generally vary, however, between one part in 2²³ and one part in 2²⁴ for float, or one part in 2⁵² and 2⁵³ for double. Precision very near zero isn't that good, with the second-smallest positive value being twice as big as the smallest, which will in turn be infinitely greater than zero. Throughout the most of the range, however, precision will vary as described above.

Note that while it often isn't practical to have types whose relative precision varies by less than a factor of two throughout its range, the variation in precision can sometimes cause calculations to yield much less accurate calculations than it would appear they should. Consider, for example, 16777215.0f + 4.0f - 4.0f. All of the values would be precisely representable as float using the same scale, and the nearest values to the large one are +/- one part in 16,777,215, but the first addition yields a result in part of the float range where values are separated by one part in only 8,388,610, causing the result to be rounded to 16,777,220. Consequently, subtracting 4 yields 16,777,216 rather than 16,777,215. For most values of float near 16777216, adding 4.0f and subtracting 4.0f would yield the original value unchanged, but the changing precision right at the break-over point causes the result to be off by an extra bit in the lowest place.

score 0 · Answer 10 · edited Jun 01 '15 at 01:54

0

Well the answer to this is simple but complicated. These numbers are stored in binary. Depending on if it is a float or a double, the computer uses different amounts of binary to store the number. The precision that you get depends on your binary. If you don't know how binary numbers work, it would be a good idea to look it up. But simply put, some numbers need more ones and zeros than other numbers.

So the precision is fixed (same number of binary digits), but the actual precision that you get depends on the numbers that you are using.

edited Jun 01 '15 at 01:54

Wandering Fool

2,170
3
18
48

answered May 29 '15 at 19:17

NendoTaka

1,224
8
14

The 'actual precision' is the binary precision, and is fixed. The precision when converted to a form which does not reflect that which the number is actually stored in shouldn't be called the 'actual precision' – Pete Kirkham May 30 '15 at 10:23

Is floating point precision mutable or invariant?

10 Answers10

Linked