How does conversion from integer to floating-point work?

Question

Each programming language has is own way to convert an integer to a float, translating 01010 to other 01010. If you see ASM generated code it uses coprocessor instructions to hide to the user the real value.

But how does it work in real? how is calculated the mantissa, exponent algorithmic-ally?

A cast is a syntactic construct, like `(float)`. There is no such thing as “auto-cast”. You mean “conversion”. — Pascal Cuoq, May 13 '14 at 04:34
Yes but I meant what procedure is performed to convert an integer to a float — M4rk, May 13 '14 at 12:27
I don't see why how many people downvote my question if 4 people have upvoted the answer.. — M4rk, Nov 16 '14 at 20:28
You don't see how a question can be considered bad and an answer to the same question good? How about a hypothetical question that is so vague as to make it difficult to answer, shows no research, and poor grammar? Such a question could still receive a clear, explanatory, authoritative answer and still be a bad question. — Pascal Cuoq, Nov 16 '14 at 21:08

Jester · Accepted Answer · 2014-05-12T20:45:26.583

If you know the floating point format, you should have been able to work out the algorithm yourself.

If the input is 0, the result is all 0 bits.
If the input is negative, set the sign bit to 1, and complement the input.
Find the highest bit set. Add the bias to its index, that's gonna be your exponent.
Clear the highest bit set, what remains is the mantissa.

Since this question has been tagged assembly, here is a sample implementation for x86:

int_to_float:
    xor eax, eax
    mov edx, [esp+4]
    test edx, edx
    jz .done
    jns .pos
    or eax, 0x80000000 ; set sign bit
    neg edx
.pos:
    bsr ecx, edx
    ; shift the highest bit set into bit #23
    sub ecx, 23
    ror edx, cl         ; works for cl < 0 too
    and edx, 0x007fffff ; chop off highest bit
    or eax, edx         ; mantissa
    add ecx, 127 + 23   ; bias
    shl ecx, 23
    or eax, ecx         ; exponent
.done:
    ret

Note: this returns the float in eax, while the calling convention usually mandates st0. I just wanted to avoid FPU code totally.

score 3 · Answer 2 · answered May 12 '14 at 20:20

When converting an integer to a floating point number, it's just shifted until the mantissa is within the right range, i.e. 1 < m < 2, and the exponent is just how many steps it shifts.

The number 1010 for example is shifted until it is 1.010 and the exponent is 3 as that is how many bits it was shifted.

The first digit of the mantissa, the 1 before the decimal separator, is not stored in the number, as it's always one. (The value zero is treated as a separate case.)

The expontent (for a double precision number) is stored with an offset of 1023 (001111111111), so the expontent 3 is stored as 1026 (010000000010).

That makes the representation of 1010 as a double precision floating point number:

010000000010 010 0000000000000000000000000000000000000000000000000

All those zeroes after 010 is to fill up the rest of the 52 bit mantissa.

You can read more about the floating point format here:
Wikipedia: Double-precision floating-point format

tmyklebu · Answer 3 · 2014-05-12T21:31:38.047

For 32-bit ints, 64-bit int64s, and IEEE 64-bit doubles, the following trick works (apart from violating aliasing rules and whatnot):

double convert(int x) {
  double tricky = 0x1.8p53;
  int64 hack = (int64 &)tricky + x;
  return (double &)hack - 0x1.8p53;
}

Here I take tricky = 2^53 + 2^52. The smallest representable change in this value is 1, meaning the significand is measured in units of 1. The significand is stored in the low-order 52 bits of a double. I won't overflow or underflow the significand by adding x to it (since x is 32-bit), so hack is the binary representation of 2^53 + 2^52 + x as a double. Subtracting off 2^53 + 2^52 gives me x, but as a double.

(What follows, I think, is sorta close to x86-64 assembly code. I don't see why it wouldn't do the right thing, but I haven't tested it. Or even assembled it.)

movsx rax, dword ptr [x]
add rax, [tricky]
mov [hack], rax
fld [hack]
fsub st(0), [tricky]
fstp [answer]

does this hack relate to this http://stackoverflow.com/questions/17035464/a-fast-method-to-round-a-double-to-a-32-bit-int-explained? — phuclv, Nov 07 '14 at 10:41

How does conversion from integer to floating-point work?

3 Answers3