1

Working on a class assignment, I'm trying to cast an integer to a float only using bit manipulations (limited to any integer/unsigned operations incl. ||, &&. also if, while). My code is working for most values, but some values are not generating the results I'm looking for.

For example, if x is 0x807fffff, I get 0xceff0001, but the correct result should be 0xceff0000. I think I'm missing something with my mantissa and rounding, but can't quite pin it down. I've looked at some other threads on SO as well converting-int-to-float and how-to-manually

unsigned dl22(int x) {


    int tmin = 0x1 << 31;
    int tmax = ~tmin;

    unsigned signBit = 0;
    unsigned exponent;
    unsigned mantissa;
    int bias = 127;

    if (x == 0) {
        return 0;
    }

    if (x == tmin) {
        return 0xcf << 24;
    }

    if (x < 0) {
        signBit = x & tmin;
        x = (~x + 1);
    }


    exponent = bias + 31;

    while ( ( x & tmin) == 0 ) {
        exponent--;
        x <<= 1;
    }

    exponent <<= 23;
    int mantissaMask = ~(tmin >> 8);
    mantissa = (x >> 8) & mantissaMask;

    return (signBit | exponent | mantissa);
}

EDIT/UPDATE Found a viable solution - see below

Community
  • 1
  • 1
J Moonham
  • 35
  • 5
  • You have an odd definition of "using only bitwise operations". I see relational operations and arithmetic operations in your code, too. (And simple and compound assignment operations as well, but I don't suppose you meant to exclude assignments.) – John Bollinger Feb 02 '17 at 22:38
  • We're allowed to relational comparitors and integer and unsigned multiplication, but no casting (also allowed to use if/while loops) – J Moonham Feb 02 '17 at 22:44
  • 1
    Note: `0x1 << 31` is undefined behavior although it may work for you. – chux - Reinstate Monica Feb 02 '17 at 23:47
  • Interesting @chux, do you have any more details on why its undefined behavior, or the implications of it? – J Moonham Feb 03 '17 at 00:29
  • 1
    In C, `int` overflow is UB. 1 shifted 31 times left on a 32-bit `int` is like 2 to the 31st power. A value outside the range of `int`. A compiler is not oblige to generate reliable code. Could use `int tmin = INT_MIN;`. – chux - Reinstate Monica Feb 03 '17 at 00:44
  • To be clear, that's `int` and other *signed* integer overflow that's UB. Unsigned integer operations have well-defined overflow behavior. – John Bollinger Feb 03 '17 at 00:46
  • Note that you can compute `tmax` without overflow as `~0U >> 1` (assigned to an `int`). You can compute `tmin` from that via the `~` operator. This avoids UB and therefore is better form, but it's entirely possible that in practice, it yields exactly the same result (i.e. that UB isn't in practice causing you trouble). – John Bollinger Feb 03 '17 at 01:16
  • I tested your code [at ideone](http://ideone.com/3wTNvh), and it produced your expected result. It is plausible that your own results suffer from the undefined behavior in your code -- and that UB should be and can be fixed -- but I'm inclined to suspect that the problem you've asked about is somewhere other than in the function you've presented. I have also removed my answer focusing on rounding. That is still an issue for you, but it does not give rise to the particular problem highlighted by your example. – John Bollinger Feb 03 '17 at 01:53
  • @JohnBollinger, thanks, but I'm still not quite there yet. I think the solution you shared the first time might be what I need to figure out. I tried this code `int tmax = ~0U >> 1` and `int tmin = ~tmax`. Unfortunately, I'm still not getting expected results. For example, here is one of my latest test results **ERROR: Test (-2147483647[0x80000001]) failed...Gives -822083585[0xceffffff]. Should be -822083584[0xcf000000]**. I'm going to explore more into the rounding on the mantissa, because this seems to be prevalent with large numbers above |2^23|. I'm open to any other ideas. – J Moonham Feb 03 '17 at 19:23
  • @JMoonham, I have edited and undeleted my answer. I'm inclined to agree that your discrepancies arise from incorrect (with respect to expectation) rounding, but I can't say much more without knowing which rounding mode you're targeting. If you're after the default mode, however, then I do explain what that is. – John Bollinger Feb 03 '17 at 22:57

2 Answers2

0

Your code produces the expected output for me on the example you presented. As discussed in comments, however, from C's perspective it does exhibit undefined behavior -- not just in the computation of tmin, but also, for the same reason, in the loop wherein you compute the exponent. To whatever extent this code produces results that vary from environment to environment, that will follow either from the undefined behavior or from your assumption about the size of [unsigned] int being incorrect for the C implementation in use.

Nevertheless, if we assume (unsafely)

  1. that shifts of ints operate as if the left operand were reinterpreted as an unsigned int with the same bit pattern, operated upon, and the resulting bit pattern reinterpreted as an int, and
  2. that int and unsigned int are at least 32 bits wide,

then your code seems correct, modulo rounding.

In the event that the absolute value of the input int has more than 24 significant binary digits (i.e. it is at least 224), however, some precision will be lost in the conversion. In that case the correct result will depend on the FP rounding mode you intend to implement. An incorrectly rounded result will be off by 1 unit in the last place; how many results that affects depends on the rounding mode.

Simply truncating / shifting off the extra bits as you do yields round toward zero mode. That's one of the standard rounding modes, but not the default. The default rounding mode is to round to the nearest representable number, with ties being resolved in favor of the result having least-significant bit 0 (round to even); there are also three other standard modes. To implement any mode other than round-toward-zero, you'll need to capture the 8 least-significant bits of the significand after scaling and before shifting them off. These, together with other details depending on the chosen rounding mode, will determine how to apply the correct rounding.

About half of the 32-bit two's complement numbers will be rounded differently when converted in round-to-zero mode than when converted in any one of the other modes; which numbers exhibit a discrepancy depends on which rounding mode you consider.

John Bollinger
  • 160,171
  • 8
  • 81
  • 157
  • I think this gets right to the point, thanks again for all the guidance! I realize now I forgot to include this code I am trying to replicate which us an unsigned to float:: `float u2f(unsigned u) { union { unsigned u; float f; } a; a.u = u; return a.f; }` – J Moonham Feb 03 '17 at 23:16
0

I didn't originally mention that I am trying to imitate a U2F union statement:

float u2f(unsigned u) {
  union {
    unsigned u;
    float f;
  } a;
  a.u = u;
  return a.f;
}

Thanks to guidance provided in the postieee-754-bit-manipulation-rounding-error I was able to manage the rounding issues by putting the following after my while statement. This clarified the rounding that was occurring.

lsb = (x >> 8) & 1;
roundBit = (x >> 7) & 1;
stickyBitFlag = !!(x & 0x7F);

exponent <<= 23;

int mantissaMask = ~(tmin >> 8);
mantissa = (x >> 8);
mantissa &= mantissaMask;

roundBit = (roundBit & stickyBitFlag) | (roundBit & lsb);

return (signBit | exponent | mantissa) + roundBit;
Community
  • 1
  • 1
J Moonham
  • 35
  • 5