What is wrong with my IEEE 754 floating point representation?

Question

I am being asked in a homework to represent the decimal 0.1 in IEEE 754 representation. Here are the steps I made:

However online converters, and this answer on stack exchange suggests otherwise. They put this solution:

s eeeeeeee mmmmmmmmmmmmmmmmmmmmmmm
0 01111011 10011001100110011001101

The difference is the number 1 at the right. Why isn't it 1100, why is it 1101?

Because of rounding. Since the number has no finite representation in binary, the closest approximation is stored. — njuffa, Jul 04 '16 at 14:44

Pascal Cuoq · Accepted Answer · 2016-07-04T16:39:22.393

6

As njuffa said in a comment, rounding is the explanation for the difference you see. Converters usually produce the nearest floating-point value to the decimal number you put in. The IEEE 754 standard recommends that the rounding mode be taken into account for conversions from one base to another (such as from decimal to binary), and the default rounding mode is “to nearest”.

The two closest single-precision floating-point values to 1/10 are 1.10011001100110011001100×2^-4 and 1.10011001100110011001101×2^-4 (below and above 1/10). The digits that are cut off are “11001100…”, indicating that the real 1/10 is closer to the upper bound than to the lower bound(if the remaining digits had been “100000000…”, the real number would have been exactly in-between the two). For this reason, the upper value 1.10011001100110011001101×2^-4 is chosen as the conversion of 1/10 to binary32 when converting in round-to-nearest mode.

edited Jul 04 '16 at 16:39

answered Jul 04 '16 at 16:33

Pascal Cuoq

79,187
7
161
281

1

In other words, it makes sense to generate a few more bits to the right, and then inspect their values to see whether you should round and in which direction. @GeorgeStop: Read up about guard, round and sticky bits. – Rudy Velthuis Jul 04 '16 at 18:21
@RudyVelthuis I do not see what guard, round and sticky bits have to do with conversion between decimal and binary. We have already have a discussion about these conversions: can you provide this time a reference to how they are supposed to work? – Pascal Cuoq Jul 04 '16 at 20:18
These are extra bits used for rounding. As I said, I would calculate a few bits more and use these extra bits (on the right) to do the proper rounding. There are certain formulae for when to round up by one least significant bit and when not. One example: http://stackoverflow.com/a/8984135/95954 . – Rudy Velthuis Jul 04 '16 at 20:22
@RudyVelthuis I know what these three bits are. In order to be sufficient, the last one has to be computed according to special rules (it's “sticky”), and this works well enough for basic arithmetic operations on binary operands, but I do not see what this has to do with conversion between decimal and binary. Your link does not show how they are useful for that either. – Pascal Cuoq Jul 04 '16 at 20:27
@RudyVelthuis If you prefer a concrete example, here is one: I have the number 1e-250 and I would like to compute its nearest binary64 representation. At what point of the computation do the guard, round and sticky bit help? – Pascal Cuoq Jul 04 '16 at 20:29
FWIW, I use these for my `BigDecimal` implementation for Delphi, e.g. when converting to `double`or `float`. `BigDecimal`has many more bits, and I use the first bits that "fall off" (e.g. for double, the 54th, 55th and rest bits) as G, R and S bits. Since I can do several types of rounding (round up, round down, round to 0, round away from 0, round half up, etc.), this helped me a lot. And apparently, it produces proper results. I checked my results with Java's BigDecimal results. But these are generally used once you already converted from decimal to binary, with some extra bits to decide. – Rudy Velthuis Jul 04 '16 at 20:30
I would first have to convert 1e-250 to binary, but with a few extra bits, i.e. more than 53, say 64 bits. These extra 11 bits form the G (54th), R (55th) bits and the other bits (56th-64th) below that are used to get the S bit. Then I follow the formula to decide how to round the 53 bits I need for a double. So these bits are only used to round from higher precision to a lower one. The decimal-to-binary conversion is a different thing. – Rudy Velthuis Jul 04 '16 at 20:35
This is used to get the value ending on 1101 instead of 1100, by calculating a few bits more and rounding according to GRS. It is the rounding necessary to get the value closest to the desired value. I only referred to that, as comment, not as answer. How to do the decimal to binary is not answered. – Rudy Velthuis Jul 04 '16 at 20:40
@RudyVelthuis How do you decide what is enough for "a few bits more" in the conversion context? – Patricia Shanahan Jul 04 '16 at 21:37
I usually don't have to (I usually have plenty of bits -- hundreds -- more in my implementations), but generally, 10 bits are more than enough, in my experience. For a double, requiring 53 bits of significand, calculating 64 bits of significand before you round would be a nice choice. Note that I mainly use this for BigDecimal (which internally uses a binary BigInteger and a decimal scale), which has its own decimal-to-binary conversion routine and generates as many bits as required. I only round to 53 bits when I need to convert to double, and then use GRS bits. – Rudy Velthuis Jul 04 '16 at 21:47
For a straight conversion, you never need a guard bit, just round and sticky. Guard bits are only needed if you don't know what the position of the high-order bit before rounding is going to be *a priori*. – Stephen Canon Jul 05 '16 at 11:50

What is wrong with my IEEE 754 floating point representation?

1 Answers1