Converting Int to Float or Float to Int using Bitwise operations (software floating point)

Question

I was wondering if you could help explain the process on converting an integer to float, or a float to an integer. For my class, we are to do this using only bitwise operators, but I think a firm understanding on the casting from type to type will help me more in this stage.

From what I know so far, for int to float, you will have to convert the integer into binary, normalize the value of the integer by finding the significand, exponent, and fraction, and then output the value in float from there?

As for float to int, you will have to separate the value into the significand, exponent, and fraction, and then reverse the instructions above to get an int value?

I tried to follow the instructions from this question: Casting float to int (bitwise) in C.
But I was not really able to understand it.

Also, could someone explain why rounding will be necessary for values greater than 23 bits when converting int to float?

[Casting float to int (bitwise) in C](//stackoverflow.com/q/12342926) is *type-punning* the `binary32` bit-pattern to an `unsigned int`, *not* converting (with rounding) to the nearest integer. Type-punning would be the first step in implementing software floating-point that shifts the mantissa by the exponent to line up the decimal point in the right place. (Actually the radix point, because this is base 2 not base 10, so the word "decimal" is wrong.) — Peter Cordes, Mar 14 '19 at 23:54
duplicates: [How to manually (bitwise) perform (float)x?](https://stackoverflow.com/q/12336314/995714), [How to convert an unsigned int to a float?](https://stackoverflow.com/q/19529356/995714), [Casting float to int (bitwise) in C](https://stackoverflow.com/q/12342926/995714) — phuclv, May 15 '19 at 04:15

score 20 · Accepted Answer · edited Mar 14 '19 at 18:48

First, a paper you should consider reading, if you want to understand floating point foibles better: "What Every Computer Scientist Should Know About Floating Point Arithmetic," http://www.validlab.com/goldberg/paper.pdf

And now to some meat.

The following code is bare bones, and attempts to produce an IEEE-754 single precision float from an unsigned int in the range 0 < value < 2²⁴. That's the format you're most likely to encounter on modern hardware, and it's the format you seem to reference in your original question.

IEEE-754 single-precision floats are divided into three fields: A single sign bit, 8 bits of exponent, and 23 bits of significand (sometimes called a mantissa). IEEE-754 uses a hidden 1 significand, meaning that the significand is actually 24 bits total. The bits are packed left to right, with the sign bit in bit 31, exponent in bits 30 .. 23, and the significand in bits 22 .. 0. The following diagram from Wikipedia illustrates:

floating point format

The exponent has a bias of 127, meaning that the actual exponent associated with the floating point number is 127 less than the value stored in the exponent field. An exponent of 0 therefore would be encoded as 127.

(Note: The full Wikipedia article may be interesting to you. Ref: http://en.wikipedia.org/wiki/Single_precision_floating-point_format )

Therefore, the IEEE-754 number 0x40000000 is interpreted as follows:

Bit 31 = 0: Positive value
Bits 30 .. 23 = 0x80: Exponent = 128 - 127 = 1 (aka. 2¹)
Bits 22 .. 0 are all 0: Significand = 1.00000000_00000000_0000000. (Note I restored the hidden 1).

So the value is 1.0 x 2¹ = 2.0.

To convert an unsigned int in the limited range given above, then, to something in IEEE-754 format, you might use a function like the one below. It takes the following steps:

Aligns the leading 1 of the integer to the position of the hidden 1 in the floating point representation.
While aligning the integer, records the total number of shifts made.
Masks away the hidden 1.
Using the number of shifts made, computes the exponent and appends it to the number.
Using reinterpret_cast, converts the resulting bit-pattern to a float. This part is an ugly hack, because it uses a type-punned pointer. You could also do this by abusing a union. Some platforms provide an intrinsic operation (such as _itof) to make this reinterpretation less ugly.

There are much faster ways to do this; this one is meant to be pedagogically useful, if not super efficient:

float uint_to_float(unsigned int significand)
{
    // Only support 0 < significand < 1 << 24.
    if (significand == 0 || significand >= 1 << 24)
        return -1.0;  // or abort(); or whatever you'd like here.

    int shifts = 0;

    //  Align the leading 1 of the significand to the hidden-1 
    //  position.  Count the number of shifts required.
    while ((significand & (1 << 23)) == 0)
    {
        significand <<= 1;
        shifts++;
    }

    //  The number 1.0 has an exponent of 0, and would need to be
    //  shifted left 23 times.  The number 2.0, however, has an
    //  exponent of 1 and needs to be shifted left only 22 times.
    //  Therefore, the exponent should be (23 - shifts).  IEEE-754
    //  format requires a bias of 127, though, so the exponent field
    //  is given by the following expression:
    unsigned int exponent = 127 + 23 - shifts;

    //  Now merge significand and exponent.  Be sure to strip away
    //  the hidden 1 in the significand.
    unsigned int merged = (exponent << 23) | (significand & 0x7FFFFF);


    //  Reinterpret as a float and return.  This is an evil hack.
    return *reinterpret_cast< float* >( &merged );
}

You can make this process more efficient using functions that detect the leading 1 in a number. (These sometimes go by names like clz for "count leading zeros", or norm for "normalize".)

You can also extend this to signed numbers by recording the sign, taking the absolute value of the integer, performing the steps above, and then putting the sign into bit 31 of the number.

For integers >= 2²⁴, the entire integer does not fit into the significand field of the 32-bit float format. This is why you need to "round": You lose LSBs in order to make the value fit. Thus, multiple integers will end up mapping to the same floating point pattern. The exact mapping depends on the rounding mode (round toward -Inf, round toward +Inf, round toward zero, round toward nearest even). But the fact of the matter is you can't shove 24 bits into fewer than 24 bits without some loss.

You can see this in terms of the code above. It works by aligning the leading 1 to the hidden 1 position. If a value was >= 2²⁴, the code would need to shift right, not left, and that necessarily shifts LSBs away. Rounding modes just tell you how to handle the bits shifted away.

Hopefully, there's enough information there to help you reverse the process as well. :-) — Joe Z, Dec 01 '13 at 02:36
There definitely was :) Especially in our chat on the other question. You've helped me a great deal, thanks again Joe :) — Andrew T, Dec 01 '13 at 18:35
Hey Joe, I had one more question for you. The plus side is, I believe everything up to the point where the significand is bitwise-anded with 0x7FFFFF works! So thanks very much for your help so far :) However, I am getting this message when I'm trying to and the significand with 0x7FFFFF "Immediate 0x007FFFFF cannot be represented by 0-255 shifted left by 0-23 or duplicated in all, odd or even bytes" So do you think it is possible that I can strip the 23rd bit off another way? — Andrew T, Dec 02 '13 at 01:57
Ah, that's an ARM assembly limitation for constants in instructions. You'll have to LDR it from a constant pool, or use a different instruction. All you really need is to clear bit 23, so BIC might be a reasonable choice. (ie. `BIC ..., #0x00800000`). I haven't coded much ARM assembly lately, but I _think_ that's valid. — Joe Z, Dec 02 '13 at 02:03
I've always been a little confused when loading hex values in ARM! But I used: "ldr r6, =0x7FFFFF", "AND r0, r6", where r0 is the significand. I believe that would work...or at least I hope so. And I also believe you are correct for the bitclear as well. When I was stepping through the program, my instruction: "ldr r1, =1", "lsl r1, 23" also turned out to be 0x00800000 :) — Andrew T, Dec 02 '13 at 02:09
Addendum some years later: The type pun in the last step should instead be handled with a memcpy() in C++ prior to C++20, and all versions of C. In C++20, there's bit_cast<>. — Joe Z, Aug 18 '20 at 01:38

score 2 · Answer 2 · answered Nov 30 '13 at 17:35

Have you checked the IEEE 754 floating-point representation?

In 32-bit normalized form, it has (mantissa's) sign bit, 8-bit exponent (excess-127, I think) and 23-bit mantissa in "decimal" except that the "0." is dropped (always in that form) and the radix is 2, not 10. That is: the MSB value is 1/2, the next bit 1/4 and so on.

score 2 · Answer 3 · edited Mar 15 '19 at 06:50

Joe Z's answer is elegant but range of input values is highly limited. 32 bit float can store all integer values from the following range:

[-2²⁴...+2²⁴] = [-16777216...+16777216]

and some other values outside this range.

The whole range would be covered by this:

float int2float(int value)
{
    // handles all values from [-2^24...2^24]
    // outside this range only some integers may be represented exactly
    // this method will use truncation 'rounding mode' during conversion

    // we can safely reinterpret it as 0.0
    if (value == 0) return 0.0;

    if (value == (1U<<31)) // ie -2^31
    {
        // -(-2^31) = -2^31 so we'll not be able to handle it below - use const
        // value = 0xCF000000;
        return (float)INT_MIN;  // *((float*)&value); is undefined behaviour
    }

    int sign = 0;

    // handle negative values
    if (value < 0)
    {
        sign = 1U << 31;
        value = -value;
    }

    // although right shift of signed is undefined - all compilers (that I know) do
    // arithmetic shift (copies sign into MSB) is what I prefer here
    // hence using unsigned abs_value_copy for shift
    unsigned int abs_value_copy = value;

    // find leading one
    int bit_num = 31;
    int shift_count = 0;

    for(; bit_num > 0; bit_num--)
    {
        if (abs_value_copy & (1U<<bit_num))
        {
            if (bit_num >= 23)
            {
                // need to shift right
                shift_count = bit_num - 23;
                abs_value_copy >>= shift_count;
            }
            else
            {
                // need to shift left
                shift_count = 23 - bit_num;
                abs_value_copy <<= shift_count;
            }
            break;
        }
    }

    // exponent is biased by 127
    int exp = bit_num + 127;

    // clear leading 1 (bit #23) (it will implicitly be there but not stored)
    int coeff = abs_value_copy & ~(1<<23);

    // move exp to the right place
    exp <<= 23;

    union
    {
        int rint;
        float rfloat;
    }ret = { sign | exp | coeff };

    return ret.rfloat;
}

Of course there are other means to find abs value of int (branchless). Similarly couting leading zeros can also be done without a branch so treat this example as example ;-).

`return *((float*)&ret);` is undefined behaviour in C (a few compilers like MSVC do define the behaviour, but others will potentially fail). Use `memcpy` or a union to type-pun. — Peter Cordes, Mar 15 '19 at 00:00
I was hoping you'd fix your answer, and letting future readers know. BTW, using `unsigned int` (or better in this case `uint32_t`) to shift / OR FP bit-patterns would also be a good idea. A left shift of a signed integer that changes the sign bit is technically signed-overflow UB, I think. (You don't actually do that, and there are enough other assumptions of 2's complement and 32-bit `int` that it maybe doesn't matter.) — Peter Cordes, Mar 15 '19 at 05:23
Also, a right-shift of a signed-value is implementation-defined, not undefined. — S.S. Anne, Mar 08 '20 at 00:20
@PeterCordes You do realize that `return (float)INT_MIN;` is pointless because this is converting `int` to `float`? Also, it can be handled in the code if the value is `unsigned`. — S.S. Anne, Mar 17 '20 at 01:55
It's not pointless, it's returning the right FP value, exactly like `0.0` is using an FP constant to return the all-zero bit-pattern. Both require the compiler to know how to produce FP bit patterns at compile time. Maybe you'd be happier if it was written as `-2147483648.0f` instead of including possibly-runtime int->float conversion and having a circular dependency? Comments still show the actual bit-pattern. — Peter Cordes, Mar 17 '20 at 04:37

Converting Int to Float or Float to Int using Bitwise operations (software floating point)

3 Answers3

Linked

Related