C++ Float Division and Precision

Question

I know that 511 divided by 512 actually equals 0.998046875. I also know that the precision of floats is 7 digits. My question is, when I do this math in C++ (GCC) the result I get is 0.998047, which is a rounded value. I'd prefer to just get the truncated value of 0.998046, how can I do that?

  float a = 511.0f;
  float b = 512.0f;
  float c = a / b;

Can't you use doubles for extra precision and truncate that? — Andrei, May 14 '11 at 16:38
This is game code and while double would solve the problem as stated, I'm doing this calculation for texture rendering and a double would probably add a performance hit. The problem is, the rounding is causing one pixel offset in the textures. — Nick Gotch, May 14 '11 at 16:51
Your comment reveals that you don't really know what you're doing. "One pixel offset in the textures"? Tell us more about that, and perhaps we can help. — TonyK, May 14 '11 at 16:55
@Nick - Maybe if you show us the code causing the 1-pixel error, we can help you with that (as a separate question, probably...) — Dietrich Epp, May 14 '11 at 16:56
Don't be too sure that `double`s would cause a performance hit. On many systems when you use `float` it actually converts everything to `double`, does all the math, then converts backs to `float` -- so it's actually doing more work when you use `float`. — QuantumMechanic, May 14 '11 at 16:56
Double precision is usually only a significant performance hit when processing a large amount of data, simply because twice as much needs to be moved in and out of memory. The actual floating-point operations probably take the same amount of time. You may be better off doing this with fixed-point arithmetic to avoid unpredictable floating-point artefacts artefacts. — Clifford, May 14 '11 at 17:10
As @Dietrich pointed out in the solution, the answer to this question was due to the formatting of the print out in the debugger. That means it's not this value causing the texture offset. I might post a follow-up question after I explore this some more. Thanks all! — Nick Gotch, May 14 '11 at 17:11
@QuantumMechanic - This is not true on x86/x64 and it is not true on PowerPC. What systems are you talking about? (And conversions between `float` and `double` are basically free anyway). — Dietrich Epp, May 14 '11 at 17:18
@Clifford - No, the operations do not take the same amount of time. They are implemented with different opcodes on most FPUs and they do different amounts of work. — Dietrich Epp, May 14 '11 at 17:20
@Deitrich:As I explained, loading registers from memory may take longer (though probably not if using a 64bit OS), but on a modern x86 processor, most (if not all, I have not checked exhaustively), FPU instructions do not differ in processor cycles between single and double precision for example [FDIV](http://home.comcast.net/~fbui/intel_f.html#fdiv) is identical for mem32 and mem64 operands. Different amounts of work perhaps, but the width of the Pentium FPU execution engine is 80bits; that's more transistors to switch maybe, but the process is parallel not sequential. Other FPU's vary. — Clifford, May 15 '11 at 20:41

score 23 · Accepted Answer · answered May 14 '11 at 16:45

23

Well, here's one problem. The value of 511/512, as a float, is exact. No rounding is done. You can check this by asking for more than seven digits:

#include <stdio.h>
int main(int argc, char *argv[])
{
    float x = 511.0f, y = 512.0f;
    printf("%.15f\n", x/y);
    return 0;
}

Output:

0.998046875000000

A float is stored not as a decimal number, but binary. If you divide a number by a power of 2, such as 512, the result will almost always be exact. What's going on is the precision of a float is not simply 7 digits, it is really 23 bits of precision.

See What Every Computer Scientist Should Know About Floating-Point Arithmetic.

answered May 14 '11 at 16:45

Dietrich Epp

205,541
37
345
415

4

24 bits due to the fact that it is possible to get one more bit by keeping the number normalized. – AProgrammer May 14 '11 at 17:00
Exactly. The only rounding that occurs in the questioners' example is when he prints out the value. And like @AProgrammer said, it has **24** bits of precision. – Stephen Canon May 14 '11 at 17:01
This answers this question even though I still have the pixel offset problem in my original code, but thats for the help! – Nick Gotch May 14 '11 at 17:09
Mathematically, it's 7.22 decimal digits of precision, however, due to digit slicing, it is necessary to use up to 9 decimal digits to represent a particular float. See my answer [here](http://stackoverflow.com/questions/4738768/printing-double-without-losing-precision/4738909#4738909) – ThomasMcLeod May 14 '11 at 17:27
@ThomasMcLeod, 6.92, not 7.22. For instance 0x1.0624d2p-10=9.99999349e-04 and 0x1.0624d4p-10=9.99999465e-04 are two successives float, so representing 9.999994e-04 is problematic. – AProgrammer May 15 '11 at 07:22
@AProgrammer, you're correct, it's 6.92 (23 log 2). However, this still requires 8 decimal digits to differentiate between adjacent floats (since the .92 can be part of 2 decimal digits). – ThomasMcLeod May 15 '11 at 15:50
@AProgrammer: you forget the implicit bit: it's actually 24 bits of precision so 7.2247something is the right answer. – Olof Forshell May 17 '11 at 11:47
@Olof, please look at the history. I'm the one who mentioned the implicit bit. The formula to use is floor((p-1)log_10 b) (+ 1 if b is a power of 10). I gave an example why. Here is another one (for 80 bits long double): 0x8.3126e978d4fdf39p-13 = 9.99999999999999999747e-04 and 0x8.3126e978d4fdf3ap-13 = 9.99999999999999999853e-04 – AProgrammer May 17 '11 at 11:55

score 5 · Answer 2 · answered May 14 '11 at 16:57

5

I also know that the precision of floats is 7 digits.

No. The most common floating point format is binary and has a precision of 24 bits. It is somewhere between 6 and 7 decimal digits but you can't think in decimal if you want to understand how rounding work.

As b is a power of 2, c is exactly representable. It is during the conversion in a decimal representation that rounding will occurs. The standard ways of getting a decimal representation don't offer the possibility to use truncation instead of rounding. One way would be to ask for one more digit and ignore it.

But note that the fact that c is exactly representable is a property of its value. SOme apparently simpler values (like 0.1) don't have an exact representation in binary FP formats.

answered May 14 '11 at 16:57

AProgrammer

51,233
8
91
143

24 bits of precision is not "between 6 and 7 decimal digits" because the range 0 to 2^24-1 equals 0 to 16777215 so the right answer is between 7 and 8 digits since 7 digits (9999999) is obviously less than 16777215 and 8 digits (99999999) is obviously more than 16777215. – Olof Forshell May 17 '11 at 11:33
@Olof, 0x1.0624d2p-10=9.99999349e-04 and 0x1.0624d4p-10=9.99999465e-04 are two successives float, so representing 9.999994e-04 is problematic and you don't have 7 decimal digits of precision. – AProgrammer May 17 '11 at 11:39
@OlofForshell, your analysis is straightforward but incorrect. Because the binary values and decimal values don't line up precisely, it's possible to skip a value even though the range is larger. It takes a range 2x what you think you need in order to eliminate this possibility, thus you lose a bit. – Mark Ransom Dec 15 '11 at 22:44
@Mark Ransom: 16777215 is the largest odd integer that may be represented as a float. This is because it corresponds to 2^24-1 i e contains binary ones in a row which corresponds to the 24 (23 explicit + 1 implicit) bits in the float significand. Beginning with 16777216 every other integer may be represented up to 2^25-2. Actually the ranges are "0 to 2^24-2^0 by 2^0" followed by "2^24 to 2^25-2^1 by 2^1", "2^25 to 2^26-2^2 by 2^2" and so on. – Olof Forshell Dec 16 '11 at 15:40

Olof Forshell · Answer 3 · 2014-04-22T07:31:09.700

Your question is not unique, it has been answered numerous times before. This is not a simple topic and just because answers are posted doesn't necessarily mean they'll be of good quality. If you browse a little you'll find the really good stuff. And it will take you less time.

I bet someone will -1 me for commenting and not answering.

_____ Edit _____

What is fundamental to understanding floating point is to realize that everything is displayed in binary digits. Because most people have trouble grasping this they try to see it from the point of view of decimal digits.

On the subject of 511/512 you can start by looking at the value 1.0. In floating point this could be expressed as i.000000... * 2^0 or implicit bit set (to 1) multiplied by 2^0 ie equals 1. Since 511/512 is less than 1 you need to start with the next lower power -1 giving i.000000... * 2^-1 i e 0.5. Notice that the only thing that has changed is the exponent. If we want to express 511 in binary we get 9 ones - 111111111 or in floating point with implicit bit i.11111111 - which we can divide by 512 and put together with the exponent of -1 giving i.1111111100... * 2^-1.

How does this translate to 0.998046875?

Well to begin with the implicit bit represents 0.5 (or 2^-1), the first explicit bit 0.25 (2^-2), the next explicit bit 0.125 (2^-3), 0.0625, 0.03125 and so on until you've represented the ninth bit (eighth explicit). Sum them up and you get 0.998046875. From the i.11111111 we find that this number represents 9 binary digits of precision and, coincidentally, 9 decimal precision.

If you multiply 511/512 by 512 you will get i1111111100... * 2^8. Here there are the same nine binary digits of precision but only three decimal digits (for 511).

Consider i.11111111111111111111111 (i + 23 ones) * 2^-1. We will get a fraction (2^(24-1)^/(2^24))with 24 binary and 24 decimal digits of precision. Given an appropriate printf formatting all 24 decimal digits will be displayed. Multiply it by 2^24 and you still have 24 binary digits of precision but only 8 decimal (for 16777215).

Now consider i.1111100... * 2^2 which comes out to 7.875. i11 is the integer part and 111 the fraction part (111/1000 or 7/8ths). 6 binary digits of precision and 4 decimal.

Thinking decimal when doing floating-point is utterly detrimental to understanding it. Free yourself!

-1 for a text that can be reused verbatim under many questions. — Evgeni Sergeev, Apr 21 '14 at 07:22
@EvgeniSergeev: Help youself! This is about some of the intricacies of floating-point math. From my personal experience on the subject and considering that more or less the same questions are posted over and over again I'd say that this is a topic that ranks well above average in complexity or, perhaps, perceived complexity. Those who answer often seem to be most interested in points but have relatively little actual subject knowledge to share - it is often incorrect too. — Olof Forshell, Apr 21 '14 at 08:45
@OlofForshell I guess the idea is that you should be specific, and you can be here, by downvoting and commenting on incorrect answers. I'm not saying it's up to you to do that alone, but over time the community will bring the more correct and useful answers to the top. It's a good feature of this site, that it would still work well even if this page was flooded with incorrect answers. — Evgeni Sergeev, Apr 21 '14 at 16:10

Clifford · Answer 4 · 2011-05-14T17:12:01.080

That 'rounded' value is most likley what is displayed through some output method rather than what is actually stored. Check the actual value in your debugger.

With iostream and stdio, you can specify the precision of the output. If you specify 7 significant digits, convert it to a string, then truncate the string before display you will get the output without rounding.

Can't think of one reason why you would want to do that however, and given the subseqent explanation of teh application, you'd be better off using double precision, though that will most likely simply shobe problems to somewhere else.

score 0 · Answer 5 · answered May 14 '11 at 16:36

0

If you are just interested in the value, you could use double and then multiply the result by 10^6 and floor it. Divide again by 10^6 and you will get the truncated value.

answered May 14 '11 at 16:36

Shamim Hafiz - MSFT

21,454
43
116
176

C++ Float Division and Precision

5 Answers5

Linked