13

I'm trying to convert an int into a custom float, in which the user specifies the amount of bits reserved for the exp and mantissa, but I don't understand how the conversion works. My function takes in an int value and and int exp to represent the number (value * 2^exp) i.e value = 12, exp = 4, returns 192. but I don't understand the process I need to do to change these. I've been looking at this for days and playing with IEEE converter web apps but I just don't understand what the normalization process is. Like I see that its "move the binary point and adjust the exponent" but I have no idea what this means, can anyone give me an example to go off of? Also I don't understand what the exponent bias is. The only info I have is that you just add a number to your exponent but I don't understand why. I've been searching Google for an example I can understand but this just isn't making any sense to me

Tommy K
  • 1,759
  • 3
  • 28
  • 51
  • It is the binary equivalent of 0.01 --> 1e-2 IOW: shift the mantissa right/left and add/substract the count to the exponent. – wildplasser Mar 01 '15 at 23:36
  • 1
    If value is 12 and we have that in a binary value it's `00001100`. That needs to be shifted over to be `11000000 x 2^-4`, and then we forget about the leftmost bit (since it's "always" 1) and just say this is `[1]1000000 x 2^-4`. – U2EF1 Mar 01 '15 at 23:38
  • Can you clarify what you mean by "I don't understand the process I need to do to change these"? Do you mean you aren't sure how to change them when you do addition/multiplication? – eigenchris Mar 01 '15 at 23:42
  • @eigenchris Like how do I take decimal 12, and make it a normalized mantissa, then adjust the exp part accordingly – Tommy K Mar 01 '15 at 23:46
  • 1
    @U2EF1 so how do I know how many times it needs to be shifted over? Like if the user specifices 4 bits for the mantissa, and the value is 3, how do I know to shift 0011 over to 1000? Could I so something like get max_val = pow(2,)-1 then shift value(0011) right until value > max_val, and have a counter keep track of how many times I do this? – Tommy K Mar 01 '15 at 23:51

5 Answers5

41

A floating point number is normalized when we force the integer part of its mantissa to be exactly 1 and allow its fraction part to be whatever we like.

For example, if we were to take the number 13.25, which is 1101.01 in binary, 1101 would be the integer part and 01 would be the fraction part.

I could represent 13.25 as 1101.01*(2^0), but this isn't normalized because the integer part is not 1. However, we are allowed to shift the mantissa to the right one digit if we increase the exponent by 1:

  1101.01*(2^0)
= 110.101*(2^1)
= 11.0101*(2^2)
= 1.10101*(2^3)

This representation 1.10101*(2^3) is the normalized form of 13.25.


That said, we know that normalized floating point numbers will always come in the form 1.fffffff * (2^exp)

For efficiency's sake, we don't bother storing the 1 integer part in the binary representation itself, we just pretend it's there. So if we were to give your custom-made float type 5 bits for the mantissa, we would know the bits 10100 would actually stand for 1.10100.

Here is an example with the standard 23-bit mantissa:

enter image description here


As for the exponent bias, let's take a look at the standard 32-bit float format, which is broken into 3 parts: 1 sign bit, 8 exponent bits, and 23 mantissa bits:

s eeeeeeee mmmmmmmmmmmmmmmmmmmmmmm

The exponents 00000000 and 11111111 have special purposes (like representing Inf and NaN), so with 8 exponent bits, we could represent 254 different exponents, say 2^1 to 2^254, for example. But what if we want to represent 2^-3? How do we get negative exponents?

The format fixes this problem by automatically subtracting 127 from the exponent. Therefore:

  • 0000 0001 would be 1 -127 = -126
  • 0010 1101 would be 45 -127 = -82
  • 0111 1111 would be 127-127 = 0
  • 1001 0010 would be 136-127 = 9

This changes the exponent range from 2^1 ... 2^254 to 2^-126 ... 2^+127 so we can represent negative exponents.

eigenchris
  • 5,791
  • 2
  • 21
  • 30
7

Tommy -- chux and eigenchris, along with the others have provided excellent answers, but if I am looking at your comments correctly, you still seem to be struggling with the nuts-and-bolts of "how would I take this info and then use this in creating a custom float representation where the user specifies the amount of bits for the exponent?" Don't feel bad, it is a clear as mud the first dozen times you go through it. I think I can take a stab at clearing it up.

You are familiar with the IEEE754-Single-Precision-Floating-Point representation of:

IEEE-754 Single Precision Floating Point Representation of (13.25)

  0 1 0 0 0 0 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 |- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -|
 |s|      exp      |                  mantissa                   |

That the 1-bit sign-bit, 8-bit biased exponent (in 8-bit excess-127 notation), and the remaining 23-bit mantissa.

When you allow the user to choose the number of bits in the exponent, you are going to have to rework the exponent notation to work with the new user-chosen limit.

What will that change?

  • Will it change the sign-bit handling -- No.

  • Will it change the mantissa handling -- No (you will still convert the mantissa/significand to "hidden bit" format).

So the only thing you need to focus on is exponent handling.

How would you approach this? Recall, the current 8-bit exponent is in what is called excess-127 notation (where 127 represents the largest value for 7 bits allowing any bias to be contained and expressed within the current 8-bit limit. If your user chooses 6 bits as the exponent size, then what? You will have to provide a similar method to insure you have a fixed number to represent your new excess-## notation that will work within the user limit.

Take a 6-bit user limit, then a choice for the unbiased exponent value could be tried as 31 (the largest values that can be represented in 5-bits). To that you could apply the same logic (taking the 13.25 example above). Your binary representation for the number is 1101.01 to which you move the decimal 3 positions to the left to get 1.10101 which gives you an exponent bias of 3.

In your 6-bit exponent case you would add 3 + 31 to obtain your excess-31 notation for the exponent: 100010, then put the mantissa in "hidden bit" format (i.e. drop the leading 1 from 1.10101 resulting in your new custom Tommy Precision Representation:

IEEE-754 Tommy Precision Floating Point Representation of (13.25)

  0 1 0 0 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 |- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -|
 |s|    exp    |                    mantissa                     |

With 1-bit sign-bit, 6-bit biased exponent (in 6-bit excess-31 notation), and the remaining 25-bit mantissa.

The same rules would apply to reversing the process to get your floating point number back from the above notation. (just using 31 instead of 127 to back the bias out of the exponent)

Hopefully this helps in some way. I don't see much else you can do if you are truly going to allow for a user-selected exponent size. Remember, the IEEE-754 standard wasn't something that was guessed at and a lot of good reasoning and trade-offs went into arriving at the 1-8-23 sign-exponent-mantissa layout. However, I think your exercise does a great job at requiring you to firmly understand the standard.

Now totally lost and not addressed in this discussion is what effects this would have on the range of numbers that could be represented in this Custom Precision Floating Point Representation. I haven't looked at it, but the primary limitation would seem to be a reduction in the MAX/MIN that could be represented.

David C. Rankin
  • 81,885
  • 6
  • 58
  • 85
6

"Normalization process" converts the inputs into a select range.

binary32 expects the significand (not mantissa) to be in the range 1.0 <= s < 2.0 unless the number has a minimum exponent.

Example:
value = 12, exp = 4 is the same as
value = 12/(2*2*2), exp = 4 + 3
value = 1.5, exp = 7

Since the significand always has a leading digit of 1 (unless the number has a minimum exponent), there is no need to store it. And rather than storing the exponent as 7, a bias of 127 is added to it.

value = 1.5 decimal --> 1.1000...000 binary --> 0.1000...000 stored binary (23 bits in all)
exp = 7 --> bias exp 7 + 127 --> 134 decimal --> 10000110 binary

The binary pattern stored is the concatenation of the "sign", "significand with a leading 1 bit implied" and a "bias exponent"

0 10000110 1000...000 (1 + 8 + 23 = 32 bits)

When the biased exponent is 0 - the minimum value, the implied bit is 0 and so small numbers like 0.0 can be stored.

When the biased exponent is 255 - the maximum value, data stored no longer represents finite numbers but "infinity" and "Not-a-numbers".

Check the referenced link for more details.

chux - Reinstate Monica
  • 143,097
  • 13
  • 135
  • 256
0

To answer a comment posted on 'how to do this in code': (Assuming it's an IEEE float)

A) Extract an unsigned 'exponent' and 'mantissa' from the IEEE float.

i) exp = 0x7F800000 & yourFloatVar;

//this takes bit b1 through b8 from the float. (b0 is the signed bit, b9 and on is the mantissa)

ii) exp = exp >> 23; //shift right so this exponent is right-oriented

iii) exp += 127; //add in the bias (127 is for 32-bit only)

iv) mantissa = 0x007FFFFF & yourFloatVar; //take last 23 bits from float

B) Normalizing

i)

while(true)
{
    if( ((mantissa & 0xC0000000) != 0x80000000)
         &&((mantissa & 0xC0000000) != 0x40000000) )
    {
        mantissa = mantissa << 1;
        exponent--;
    }
    else //AKA the float has been normalized
    {
        break;
    }
}

if leading 2 bits aren't '01' or '10' (this is a property of 2's complement - the condition to normalize), then shift over the mantissa and decrement the exponent.

I want to note that this isn't at all the most efficient algorithm for doing this; I just wanted to make the steps clear. Hope I didn't miss anything!

J. Doe
  • 21
  • 2
-1

To normalize a mantissa you place the decimal point to the left of the leftmost non-zero digit

for example

represent 10.11 base 2 in normalize form

= 0.1011 base 2 * 2 to the second power

the base of two is because you are working with binary numbers and the power of +ve 2 is because you moved the decimal point left two times. Remember that only 4 bits are used for the mantizza

so the mantizza would be 1011

otboss
  • 621
  • 1
  • 7
  • 16
  • 1
    can you give a more concrete example on how this is done in code? Like I understand 3.1416 in binary would be 11.00100100001111... so I need to normalize it to1.100100100001111... x 2^1 I get the abstract part but I dont understand how to actually implement this – Tommy K Mar 02 '15 at 00:17