Rounding point issues when converting to float bitwise

Question

I am working on a homework assignment, where we are supposed to convert an int to float via bitwise operations. The following code works, except it encounters rounding. My function seems to always round down, but in some cases it should round up.

For example 0x80000001 should be represented as 0xcf000000 (exponent 31, mantissa 0), but my function returns 0xceffffff. (exponent 30, mantissa 0xffffff).

I am not sure how to continue to fix these rounding issues. What steps should i take to make this work?

unsigned float_i2f(int x) {
  if(x==0) return 0;
  int sign = 0;
  if(x<0) {
    sign = 1<<31;
    x = -x;
  }
  unsigned y = x;
  unsigned exp = 31;
  while ((y & 0x80000000) == 0)
  {
    exp--;
    y <<= 1;
  }
  unsigned mantissa = y >> 8;

  return sign | ((exp+127) << 23) | (mantissa & 0x7fffff);
}

Possible duplicate of this, but the question is not properly answered.

1 << 31 is undefined behaviour. I would recommend very strongly to avoid shift operations with signed operands if at all possible. — gnasher729, Sep 08 '14 at 21:39
x = -x is undefined behaviour if x = INT_MIN and INT_MIN != - INT_MAX. — gnasher729, Sep 08 '14 at 21:40
1<<31 is not undefined behaviour, it is simply 1 with 31 zero's following, or in other words, INT_MAX. You might be confusing it with 1<<32, which is infact undefined. — jamiees2, Sep 08 '14 at 21:41
That's true, i have added a special if to check for that case. However, the question is how to fix rounding errors, which that does not fix. — jamiees2, Sep 08 '14 at 21:43
James, 1 << n should produce the value 2^n. If that value is not representable as an int the behaviour is undefined. 2^31 cannot be represented in an int, therefore undefined behaviour. And really, INT_MAX isn't 2^31. It's usually 2^31 - 1. And since I added this as a comment and not an answer, why are you complaining that it isn't an answer? — gnasher729, Sep 08 '14 at 21:47
I am sorry, i meant to say INT_MIN. I'm also sorry for the harsh response, i just spent 4 hours on this problem and am at my wits end. Thanks for a calm response though. — jamiees2, Sep 08 '14 at 21:51
To avoid the undefined behavior just declare `sign` unsigned like the other bit fields. — starblue, Sep 09 '14 at 05:40

score 3 · Accepted Answer · answered Sep 08 '14 at 21:45

3

You are obviously ignoring the lowest 8 bits of y when you calculate mantissa.

The usual rule is called "round to nearest even": If the lowest 8 bit of y are > 0x80 then increase mantissa by 1. If the lowest 8 bit of y are = 0x80 and bit 8 is 1 then increase mantissa by 1. In either case, if mantissa becomes >= 0x1000000 then shift mantissa to the right and increase exponent.

answered Sep 08 '14 at 21:45

gnasher729

51,477
5
75
98

1

@gnasher729 I think you want to say "the lowest 8 bit of y are = 0x80 and bit 9 is 1" in second case – Anton Malmygin Jul 14 '18 at 17:17

Rounding point issues when converting to float bitwise

1 Answers1