Converting float to an int (float2int) using only bitwise manipulation

Question

I am wondering if someone could set me in the right direction with a problem I am working on. I am trying to do what the following C function does using only ARM assembly and bit manipulation:

int float2int(float x) {
return (int) x;
}

I have already coded the reverse of this (int2float) without many issues. Im just unsure of where to start with this new problem.

For example:

3 (int) = 0x40400000 (float) 
0011 = 0 10000000 10000000000000000000000

Where 0 is the Sign Bit, 10000000 is the exponent, and 10000000000000000000000 is the mantissa/fraction.

Can someone simply point me in the right direction with this problem? Even a C pseudocode representation would be helpful. I know I need to extract the sign bit, extract the exponent and reverse the bias (127) and also extract the fraction but I just have no idea where to begin.

There is also the issue of if the float cannot be represented as an integer (because it overflows or is a NaN).

Any help would be appreciated!

Here is the algorithm: http://stackoverflow.com/a/12343933/1707253 — turnt, Dec 02 '13 at 00:53
Do you have access to VFP or NEON? This can be done with VCVT in one instruction. — Rob Napier, Dec 02 '13 at 00:58
As a side note, it's fairly trivial to figure out how to implement this for your setup. Compile the above function with a C compiler; look at the assembler output. — Rob Napier, Dec 02 '13 at 01:00
@RobNapier: The compiler will likely generate a single instruction. — Siyuan Ren, Dec 02 '13 at 01:07
Yes it will (and does). Which if all he wants is "using only ARM assembly" (rather than requiring the actual algorithm) is perfect. (Though I believe it will only be one instruction if you have access to a VFP or NEON.) — Rob Napier, Dec 02 '13 at 01:10
Hmmm no access to VFP or NEON. I am already aware that I can compile the C function to -s (arm assembly), this usually produces some strange results however, and will probably use ARM's built in floating point libraries. I am trying to do this using only bit-by-bit manipulation. So far I am starting by separating the sign bit, Exponent, and mantissa from the input float into different registers. — 0000101010, Dec 02 '13 at 01:15
Convert to `double`, add `2^53 + 2^52`, and grab the low 32 bits. (This will give you some slightly funky rounding which you can almost-fix by adding 0.5 beforehand.) — tmyklebu, Dec 02 '13 at 06:16

chux - Reinstate Monica · Accepted Answer · 2013-12-02T01:11:34.410

// Assume int can hold all the precision of a float.
int float2int(float x) {
  int Sign = f_SignRawBit(x);
  unsigned Mantissa = f_RawMantissaBits(x);  // 0 - 0x7FFFFF
  int Expo = f_RawExpoBits(x); // 0 - 255
  // Form correct exponent and mantissa
  if (Expo == EXPO_MAX) {
    Handle_NAN_INF();
  }
  else if (Expo == EXPO_MIN) {
    Expo += BIAS + 1 - MantissaOffset /* 23 */;
  }
  else {
    Expo += BIAS - MantissaOffset /* 23 */;
    Mantissa |= ImpliedBit;
  }
  while (Expo > 0) {
    Expo--;
    // Add code to detect overflow
    Mantissa *= 2;
  }
  while (Expo < 0) {
    Expo++;
    // Add code to note last shifted out bit
    // Add code to note if any non-zero bit shifted out
    Mantissa /= 2;
  }

  // Add rounding code depending on `last shifted out bit` and `non-zero bit shifted out`.  May not be need if rounding toward 0.

  // Add code to detect over/under flow in the following
  if (Sign) {
    return -Mantissa;
  }
  return Mantissa;
}

This is helpful, thanks. Ill try to apply some of this algorithm to my ARM assembly interpretation. — 0000101010, Dec 02 '13 at 01:16
@user3055632 Up-vote the posts you find useful. Accept one that meets your needs. — chux - Reinstate Monica, Dec 02 '13 at 03:57

score 1 · Answer 2 · answered Dec 02 '13 at 01:11

Start with a number whose mantissa is your number (00000011 in your example) and whose exponent is 01111111 (127, which is how 0 is stored in excess-of-127) Count how many bits are from the LSb to the last set bit (not included). For each bit counted, add 1 to the exponent.

In your example: there are only one bit from the LSb to the last (most significant) bit set, so the exponent is added by 1 resulting in 128 (10000000).

Shift left your mantissa (your original number) so the left-most set bit is lost. Take into account that the shift must be performed using a variable capable of holding at least 23 bits. So in your example, the original mantissa is 00000000000000000000011 . You must shift left it until the left-most '1' is lost, resulting in 10000000000000000000000 About the sign, if the original number is in 2-complement, simply take the MSb, and that will be your sign. In your example, 0 (positive)

So your result will be: Sign : 0 Exponent: 10000000 Mantissa: 10000000000000000000000

Another example: convert the short int number -234 into a float. -234 using 2-complement is stored as 1111111100010110 (16 bits)

From here it's easy to get the sign: 1 (the MSb)

We must work with the absolute magnitude, so we complement the number to get the positive (magnitude) version. We can do it by xoring it with 1111111111111111, then adding 1. This gives us 0000000011101010 (234)

Initial mantissa (using 23 bits): 00000000000000011101010 Initial exponent: 01111111 (127) Count how many bits are from the LSb to the left-most set bit, withouth including it. There are 7 bits. We add this to our exponent: 127+7=134 = 10000110 The mantissa is shifted left until the left-most set bit is gone. This gives us: 11010100000000000000000

Our number will be: 1 10000110 11010100000000000000000

Sorry if I confused you but it looks like you are describing how to convert an int to a float. I have already done this and now I am trying to convert the float value back to an int! Thanks for the response though — 0000101010, Dec 02 '13 at 01:30

score 1 · Answer 3 · answered Oct 30 '14 at 00:31

Here is some basic C++ code to do the conversion.

#include <stdint.h>

union IntFloat
{
    uint32_t i;
    float    f;
};

int32_t Float32ToInt24( const float & x )
{
    IntFloat n;
    n.f = x;

    uint8_t  negative = ((n.i >> 31) & 0x1        )            ; // 0x10000000
    uint8_t  exponent = ((n.i >> 23) &  0xFF      )            ; // 0x7F800000
    uint32_t mantissa = ((n.i >>  0) &    0x7FFFFF) | 0x800000 ; // 0x007FFFFF implicit bit
    int32_t  i        = mantissa >> (22 - (exponent - 0x80));
    if( !exponent )
        return 0;
    if( negative )
        return -i;
    else
        return  i;
}

Note: Floats bigger then 2^24 will NOT convert properly to an integer due to the mantissa only have 24-bits of precision. i.e. adding the two floats 16777216.0 + 1.0 will have no effect!

Converting float to an int (float2int) using only bitwise manipulation

3 Answers3

Linked