0

I am trying to convert hex values stored as int and convert them to floatting point numbers using the IEEE 32 bit rules. I am specifically struggling with getting the right values for the mantissa and exponent. The hex is stored from in a file in hex. I want to have four significant figures to it. Below is my code.

float floatizeMe(unsigned int myNumba ) {
    //// myNumba comes in as 32 bits or 8 byte  

    unsigned int  sign = (myNumba & 0x007fffff) >>31;
    unsigned int  exponent = ((myNumba & 0x7f800000) >> 23)- 0x7F;
    unsigned int  mantissa = (myNumba & 0x007fffff) ;
    float  value = 0;
    float mantissa2; 

    cout << endl<< "mantissa is : " << dec << mantissa << endl;

    unsigned    int m1 = mantissa & 0x00400000 >> 23;
    unsigned    int m2 = mantissa & 0x00200000 >> 22;
    unsigned    int m3 = mantissa & 0x00080000 >> 21;
    unsigned    int m4 = mantissa & 0x00040000 >> 20;

    mantissa2 = m1 * (2 ^ -1) + m2*(2 ^ -2) + m3*(2 ^ -3) + m4*(2 ^ -4);

    cout << "\nsign is: " << dec << sign << endl;
    cout << "exponent is : " << dec << exponent << endl;
    cout << "mantissa  2 is : " << dec << mantissa2 << endl;

    // if above this number it is negative 
    if ( sign  == 1)
        sign = -1; 

    // if above this number it is positive 
    else {
        sign = 1;
    }

    value = (-1^sign) * (1+mantissa2) * (2 ^ exponent);
    cout << dec << "Float value is: " << value << "\n\n\n";

    return value;
}




  int main()
{   
    ifstream myfile("input.txt");
    if (myfile.is_open())
    {
        unsigned int a, b,b1; // Hex 
        float c, d, e; // Dec
        int choice; 

        unsigned int ex1 = 0;
        unsigned int ex2 = 1;
        myfile >> std::hex;
        myfile >> a >> b ;
        floatizeMe(a);
myfile.close();
return 0;

}

Sam Arnold
  • 45
  • 1
  • 1
  • 10

6 Answers6

4

I suspect you mean for the ^ in

mantissa2 = m1 * (2 ^ -1) + m2*(2 ^ -2) + m3*(2 ^ -3) + m4*(2 ^ -4);

to mean "to the power of". There is no such operator in C or C++. The ^ operator is the bit-wise XOR operator.

Jfevold
  • 422
  • 2
  • 11
  • Thank you that was one issue. Another was that I was grabbing the 23 bit vs 22 bit. Between that and your help I am getting the right mantissa. – Sam Arnold Apr 30 '16 at 23:10
  • 1
    @SamArnold I wouldn't be upset if you accepted my answer, then... :) – Jfevold Apr 30 '16 at 23:18
4

Considering your CPU follows the IEEE standard, you can also use union. Something like this

  union
  {
    int num;
    float fnum;
  } my_union;

Then store the integer values into my_union.num and read them as float by getting my_union.fnum.

polfosol ఠ_ఠ
  • 1,840
  • 26
  • 41
  • No, you can't. This is a common misconception. Bunging things in a union doesn't absolve you of aliasing rules. Once you've stored an integer into the union, _you can only read an integer back_. If you want to read a float from it, you have to store a float there first. – Lightness Races in Orbit Nov 06 '18 at 10:40
  • @LightnessRacesinOrbit I can and [I just did](https://wandbox.org/permlink/po2iMVRWoVnCDvjE). What's the problem – polfosol ఠ_ఠ Nov 06 '18 at 10:51
  • It merely appears to work but your program has undefined behaviour. Have a read of https://stackoverflow.com/a/11373277/560648. _Just because your program seems to give expected results does not mean that it is correct._ – Lightness Races in Orbit Nov 06 '18 at 10:56
  • @LightnessRacesinOrbit I have never heard about this. But anyway, I have never encountered any problem in dealing with unions so far. Actually I have worked on a project where a big part of it was dependent on storing some type in a union and getting data in another form and we've never faced any problems! Nonetheless, if the unions should work as the way you say, then I think they look pretty useless. Why such restrictions? – polfosol ఠ_ఠ Nov 06 '18 at 11:04
  • 1
    They're only "useless" if we accept as a premise that this is their use. It is not. Again, though, you're not alone: it's a common misconception and yes it is found in big projects, sadly. – Lightness Races in Orbit Nov 06 '18 at 12:21
  • https://stackoverflow.com/questions/1856468/how-to-output-ieee-754-format-integer-as-a-float – Mohammad Kanan Apr 04 '19 at 21:22
2

We needed to convert IEEE-754 single and double precision numbers (using 32bit and 64bit encoding). We were using a C compiler (Vector CANoe/Canalyzer CAPL Script) with a restricted set of functions and ended up developing the function below (it can easily be tested using any on-line C compiler):

#include <stdio.h>
#include <math.h>

double ConvertNumberToFloat(unsigned long number, int isDoublePrecision)
{
    int mantissaShift = isDoublePrecision ? 52 : 23;
    unsigned long exponentMask = isDoublePrecision ? 0x7FF0000000000000 : 0x7f800000;
    int bias = isDoublePrecision ? 1023 : 127;
    int signShift = isDoublePrecision ? 63 : 31;

    int sign = (number >> signShift) & 0x01;
    int exponent = ((number & exponentMask) >> mantissaShift) - bias;

    int power = -1;
    double total = 0.0;
    for ( int i = 0; i < mantissaShift; i++ )
    {
        int calc = (number >> (mantissaShift-i-1)) & 0x01;
        total += calc * pow(2.0, power);
        power--;
    }
    double value = (sign ? -1 : 1) * pow(2.0, exponent) * (total + 1.0);

    return value;
}

int main()
{
    // Single Precision 
    unsigned int singleValue = 0x40490FDB; // 3.141592...
    float singlePrecision = (float)ConvertNumberToFloat(singleValue, 0);
    printf("IEEE754 Single (from 32bit 0x%08X): %.7f\n",singleValue,singlePrecision);

    // Double Precision
    unsigned long doubleValue = 0x400921FB54442D18; // 3.141592653589793... 
    double doublePrecision = ConvertNumberToFloat(doubleValue, 1);
    printf("IEEE754 Double (from 64bit 0x%016lX): %.16f\n",doubleValue,doublePrecision);
}
Ken
  • 581
  • 1
  • 6
  • 5
1

Just do the following (but of course make sure you have the right endianness when reading bytes into the integer in the first place):

float int_bits_to_float(int32_t ieee754_bits) {
    float flt;
    *((int*) &flt) = ieee754_bits;
    return flt;
}

Works for me... this of course assumes that float has 32 bits, and is in IEEE754 format, on your architecture (which is almost always the case).

Luke Hutchison
  • 8,186
  • 2
  • 45
  • 40
  • A very simple and easy solution. Thanks! – Mobi Zaman Sep 27 '22 at 06:03
  • Although, could you maybe explain the line "*((int*) &flt) = ieee754_bits;" ? – Mobi Zaman Sep 27 '22 at 06:08
  • @MobiZaman sure: `&flt` takes the address of the local float on the stack; that is then cast to an int pionter `(int*)`; the pointer is then dereferenced using the first asterisk, which as an lvalue means "write into this location". So the int bits are written into the float on the stack, then that float is returned. – Luke Hutchison Sep 28 '22 at 07:44
  • But if we are storing the value into a location pointed to by an integer-casted pointer, how is the value stored as a float instead of an integer? – Mobi Zaman Sep 29 '22 at 13:53
  • 1
    @MobiZaman because memory is untyped. You can store anything you want in memory. And you can point to the same memory location with multiple different pointer types. Look up how unions work in C/C++, for example. – Luke Hutchison Sep 30 '22 at 18:13
0

There are a number of very basic errors in your code.

The most visible is repeatedly using ^ for "power of". ^ is the XOR-operator, and for "power" you must use the function pow(base, exponent) in math.h.

Next, "I want to have four significant figures" (presumably for the mantissa), but you only extract four bits. Four bits can encode only 0..15, which is about a digit-and-a-half. To get four significant digits, you'd need at least log(10,000)/log(2) ≈ 13.288, or at least 14 bits (but preferably 17, so you get one full extra digit to get better rounding).

You extract the wrong bit for sign, and then you use it the wrong way. Yes, if it is 0 then sign = 1 and if 1 then sign = -1, but you use it in the final calculation as

value = (-1^sign) * ...

(again with a ^, although even pow does not make any sense here). You ought to have used sign * .. straight away.

exponent was declared an unsigned int, but that fails for negative values. It needs to be signed for pow(2, exponent) (corrected from your (2 ^ exponent)).

On the positive side, (1+mantissa2) is indeed correct.

With all of those points taken together, and ignoring the fact that you actually ask for only 4 significant digits, I get the following code. Note that I rearranged the initial bit shifting and extracting for convenience – I shift mantissa to the left, rather than the right, so I can test against 0 in its calculation.

(Ah, I missed this!) Using sign straight away does not work because it was declared as an unsigned int. Therefore, where you think you give it the value -1, it actually gets the value 4294967295 (more precise: the value of UINT_MAX from limits.h).

The easiest way to get rid of this is not multiplying by sign but only test it, and negate value if it is set.

float floatizeMe (unsigned int myNumba )
{
    //// myNumba comes in as 32 bits or 8 byte  

    unsigned int  sign = myNumba >>31;
    signed int  exponent = ((myNumba >> 23) & 0xff) - 0x7F;
    unsigned int  mantissa = myNumba << 9;
    float  value = 0;
    float mantissa2;

    cout << endl << "input is : " << hex << myNumba << endl;
    cout << endl << "mantissa is : " << hex << mantissa << endl;

    value = 0.5f;
    mantissa2 = 0.0f;
    while (mantissa)
    {
        if (mantissa & 0x80000000)
            mantissa2 += value;
        mantissa <<= 1;
        value *= 0.5f;
    }

    cout << "\nsign is: " << sign << endl;
    cout << "exponent is : " << hex << exponent << endl;
    cout << "mantissa 2 is : " << mantissa2 << endl;

    /* REMOVE:
       if above this number it is negative 
    if ( sign  == 1)
        sign = -1; 

    // if above this number it is positive 
    else {
        sign = 1;
    } */

    /* value = sign * (1.0f + mantissa2) * (pow (2, exponent)); */
    value = (1.0f + mantissa2) * (pow (2, exponent));
    if (sign) value = -value;
    cout << dec << "Float value is: " << value << "\n\n\n";

    return value;
}

With the above, you get correct results for values such as 0x3e4ccccd (0.2000000030) and 0x40490FDB (3.1415927410).

All said and done, if your input is already in IEEE-754 format (albeit in hex), then a simple cast ought to be enough.

Jongware
  • 22,200
  • 8
  • 54
  • 100
  • Thank you Rad Lexus. This is one of the first C++ projects. so I was not aware of the of things such as ^ meaning bitwise. A lot of the code you fixed. Unfortunately something is still off with the Sign. I switched the sign to shift over 32 bits and it now has right answer but wrong sign for negative numbers. But other than that thank you very much. – Sam Arnold May 01 '16 at 00:31
  • @SamArnold: you are correct in that negative numbers did not work with my code, apologies for not testing that. **However**, your suggested fix of "shift over 32 bits" is *dead wrong* – you'll end up with it always being zero again! See my 'oops' addition above for a correct fix. – Jongware May 01 '16 at 00:48
-1

As well as being much simpler, this also avoids any rounding/precision errors.

float value = reinterpret_cast<float&>(myNumba)

If you still want to inspect the parts separately, use the library function std::frexp afterwards. Of if you don't like the type punning, at least use std::ldexp to apply the exponent rather than your explicit maths, which is vulnerable to rounding/precision errors and overflow.

An alternate to both of these is to use a union type, as described in this answer.

Community
  • 1
  • 1
OrangeDog
  • 36,653
  • 12
  • 122
  • 207
  • This has undefined behaviour. You cannot alias an `int` as a `float`. You could `std::copy` the component bytes legally though (provided your input is guaranteed to be in a valid form). – Lightness Races in Orbit Nov 06 '18 at 10:39