Easiest way to convert a decimal float to bit representation manually based on IEEE 754, without using any library

Question

I know there are number ways to read every bit of a IEEE 754 float using written libraries.

I don't want that, and I want to be able to manually convert a decimal float to binary representation based on IEEE 754.

I understand how IEEE 754 works and I am just trying to apply it.

I ask this question here just want to see whether my way is normal or stupid and I am also wondering how PC does it quickly.

If I am given a decimal float in a string, I need to figure out what the E is and what the M is.

get the two parts out: integer part i and fraction part f.
deal with f. I constantly multiple 2 and get the integer part (either 0 or 1) and remove the integer part and then repeat, until it becomes 0
convert i to bits. This is easy I just constantly mod 2 and div 2 to get all bits of i.

for example, to convert f part

0.390625 * 2 = 0.78125 0
0.78125 * 2 = 1.5625 1
0.5625 * 2 = 1.125 1
0.125 * 2 = 0.25 0
0.25 * 2 = 0.5 0
0.5 * 2 = 1 1
0

In this case, the temparay bits of 0.390625 is 0 1 1 0 0 1.

Now, I have the bits for i and for f.

If all bits for i is 0, then on bits of f I shift_left it until the first 1 is gone, according to the default hidden 1 of M. I get M, then give the value of shifting to E, considering of E's baseline of course.

If i is not 0, then I concatenate both bits part and calculate how many shift_right I need to do to make the concatenated bits to be 1, then give this value to E

I guess all my steps are not wrong. But I feel it very troublesome.

Is there a easy and clean way?

How does PC does it?

When you say "decimal float", do you actually mean "binary float"? IEEE 754 specifies both decimal *and* binary floating-point types, and it's not clear from your question which one you're talking about. — Mark Dickinson, Apr 09 '14 at 12:14
Ah, re-reading, it's clearer: you want to convert a decimal *string* to an IEEE 754 binary float. Is that correct? — Mark Dickinson, Apr 09 '14 at 12:15
@MarkDickinson yes, you are right, i improved my question a bit by adding it — Jackson Tale, Apr 09 '14 at 12:20
Read http://www.exploringbinary.com/correct-decimal-to-floating-point-using-big-integers/ to see how you could do it yourself. If you are more interested in how existing libraries do the conversion fast, read the series on David M. Gay's strtod on the same blog. Regardless, there is nothing easy or clean in the conversion of, say, 1E100 to floating-point. It is difficult and ugly. — Pascal Cuoq, Apr 09 '14 at 13:00
I use for my bignum libs dec string -> hex string conversion from there it is easy as pie (hex string is easy shift-able and or convertible to bin) look here: http://stackoverflow.com/a/18231860/2521214 — Spektre, Apr 09 '14 at 14:01
@PascalCuoq: re "difficult and ugly: it seems to me that it's pretty straightforward so long as you don't care much about performance and you have a bigint library available. It's when either of those two things isn't true that it becomes difficult and ugly. — Mark Dickinson, Apr 09 '14 at 16:55
@MarkDickinson For various reasons I implemented the bigint version from http://www.exploringbinary.com/correct-decimal-to-floating-point-using-big-integers/ , which wasn't as seamless as the pseudocode there. Here is a list of bugs I still had to fix even after the initial implementation: 1- double rounding of subnormals first to 52 bits of precision and then to their actual precision; 2- Parsing of 0.E-9999999999 on 32-bit machines; 3- Parsing of 0.E+9999999999 on 32-bit machines; 4- I still somehow managed to get the number half the smallest positive subnormal wrong for some reason I forget — Pascal Cuoq, Apr 09 '14 at 17:03
@PascalCuoq: Okay, I accept that it's nontrivial to get the edge cases right. I've got [my own bigint-based implementation](https://github.com/mdickinson/quadfloat) too, and I guess I'm forgetting the amount of work and thought that had to go into it. It's not a lot of code, though. — Mark Dickinson, Apr 09 '14 at 17:37
@PascalCuoq re: subnormals, I said in my article "To handle subnormal numbers, reduce the precision of the significand — to between 1 and 52 bits...". I meant to round once to the correct precision, not to doubly round. re: those other bugs, I assume those were in your translation to real code, and not bugs in my pseuducode? — Rick Regan, Apr 15 '14 at 14:18
@RickRegan All the bugs were typical implementation bugs, because pseudocode is for reading and code is for executing. Problems 2- and 3- with exponents too large could occur with any bigint library whose shift/power functions take fixed-size integer arguments, a common convention. Problem 4- must have been an off-by-one error for the minimum value of the binary exponent. I am not saying that anything needs to be fixed in your blog post, just that a correct actual implementation needs to get many details right beyond the idea in your blog post. — Pascal Cuoq, Apr 15 '14 at 15:07

score 2 · Answer 1 · answered Apr 17 '14 at 17:32

I don't understand your treatment of the fraction. As shown, you are doing decimal fraction arithmetic, which would give correct results but introduces its own implementation difficulties. Doing binary fraction arithmetic would depend on converting the fraction to a binary fraction in order to convert it to a binary fraction.

I think it might be simpler to work entirely in binary integers, though you would still need an extended form, such as BigInteger.

To do that, first note the number of digits after the decimal point, D. Convert the decimal digit string to an integer N, ignoring the decimal point. The value is N/10**D, using "**" to represent power. Calculate 10**D as a binary integer.

Calculate N/10**D by binary long division, stopping when you have F+2 significant bits in the result, where F is the number of fraction bits in your floating point format. Note the location of the binary point in this result.

The most significant one bit will not be used if the number is in the normal range. To correctly round down to F fraction bits, you need both the least significant of the F+2 bits, call it G, and also a bit S that is zero if, and only if, the remainder is zero. If G is 0, use the F fraction bits unchanged. If G and S are both one, you need to round up. If G is one and S is zero the exact result is half way between two representable values, and you should round to even.

Calculate the exponent from the position of the most significant bit relative to the binary point, after processing any carry-out due to rounding up. If the exponent is in range, you are done. If it is too bit, return infinity of the appropriate sign. If it is too small, you need to denormalize. To get the rounding right, recompute G and S from the bits you are dropping and the old value of S.

score 2 · Accepted Answer · answered Apr 19 '14 at 20:21

See the files src/lib/floating_point.ml and src/lib/floating_point.mli in Frama-C. They implement the conversion from decimal representation to floating-point for single-precision and double-precision (you cannot obtain the former from the latter because of double rounding issues), without any external library. The files are covered by the LGPL 2.1. This implementation is the subject of a couple of blog posts starting at this one and continuing with this one.

This is probably close to the simplest conversion function one can make, as in writing this function, I had no performance constraints and only hoped to keep the code as simple and as correct as possible, without wanting a dependence towards an existing library such as MPFR.

...
type parsed_float = {
  f_nearest : float ;
  f_lower : float ;
  f_upper : float ;
}

val single_precision_of_string: string -> parsed_float
val double_precision_of_string: string -> parsed_float
...

Easiest way to convert a decimal float to bit representation manually based on IEEE 754, without using any library

2 Answers2

Linked