1

The following code should quantize a positive (single precision) floating-point number to a 32-bit integer. As the positive range contains only 2^31 - 1 (discrete) levels, code multiplies sample by this value, and rounds the result to an integer:

mov eax, 0x7FFFFFFF   // eax = 2^31 - 1
cvtsi2ss xmm1, eax    // convert eax to float --> xmm1
movss xmm0, [sample]  // where 'sample' is of type float
mulss xmm0, xmm1      // Get sample's quantum into xmm0
cvtss2si eax, xmm0    // Round quantum to the nearest integer --> eax

Problem is: that for a sample value of 1.0f, the end result (eax value) is 0x80000000 = 2^31, which is out of range. The expected result would be 1.0 x (2^31 - 1) = (2^31 - 1) = 0x7FFFFFFF.

Moreover, this value is actually the 2's complement representation of -2^31 (note the minus sign).

What am I missing here?

{ MSVC2010 is being used for the testing. } `

Bliss
  • 426
  • 2
  • 5
  • 19
  • 3
    0x7FFFFFFF when converted to float is rounded to 2147483648 to begin with – harold Feb 16 '16 at 22:59
  • 2
    Yeah, you should try double precision. – Jester Feb 16 '16 at 23:00
  • @harold, so basically, you're saying that `xmm1` holds the float representation of `2^31` and *not* of `2^31 - 1`? – Bliss Feb 16 '16 at 23:02
  • 4
    Single precision is 32 bit, with just 24 bits of mantissa, so yeah, you can't expect to be able to stuff a 32 bit integer into it without possible loss of precision. – Jester Feb 16 '16 at 23:05
  • 2
    Just an addendum. If using MSVC 2010,you should have been able to compile/assemble this and step through with the debugger and review the registers including XMM1 after cvtsi2ss and each subsequent instruction. You may have discovered what floating point value was stuffed in xmm1. – Michael Petch Feb 16 '16 at 23:14
  • you should load the constant value directly to xmm1 instead of converting it using `cvtsi2ss xmm1, eax`. That takes a lot of time – phuclv May 04 '16 at 09:03

1 Answers1

3

You move 231-1 to EAX and convert it from a 32-bit integer to a single (32-bit) scalar float.

mov eax, 0x7FFFFFFF   // eax = 2^31 - 1
cvtsi2ss xmm1, eax    // convert eax to float --> xmm1

The problem is that there isn't enough mantissa in an IEEE754 32-bit float to represent 231-1 accurately. It actually gets rounded up to 2.147483648E9. There is an online binary converter that can better describe how this occurred. The conversion of integer 231-1 to a single scalar float 2.147483648E9 is demonstrated here


Exactly represent every integer from 0 to 231-1 takes 31 bits. A 32bit IEEE float (with a 23 + 1 implicit bit mantissa) can exactly represent every integer with magnitude up to 224. Outside that range, powers of 2 are exactly representable.

It's provable (with information theory) that it's impossible to design a 31 bit encoding that can exactly represent all the integers from 0 to 231-1 and also be able to represent any other values. The integers use up all the coding space. If such a thing were possible, you could use the technique repeatedly to compress all the world's data into one bit.


The 0x80000000 result is how cvtss2si and cvtsd2si signal overflow. From the Intel insn ref manual (see the wiki for links):

If a converted result is larger than the maximum signed doubleword integer, the floating-point invalid exception is raised, and if this exception is masked, the indefinite integer value (80000000H) is returned.

It's nothing to do with integer wraparound, or the float value being one past the exact result.

Note that with a 64bit integer register, cvtss2si rax, xmm1 can produce results up to 0x7fffff8000000000, with larger floats producing the 0x8000000000000000 "indefinite value". This is contrary to the text description in Intel's manual, where they forgot to update the max-value paragraph for 64bit operand-size to match what cvtsd2si says. The largest integer you can round-trip to single-precision float without producing an overflow is 0x7fffffbfffffffff.


If you use a double scalar there is enough mantissa to accurately represent 231-1. The conversion of integer 231-1 to a double scalar float 2.147483647E9 is demonstrated here.

As Jester pointed out that using double (64-bit) scalar floats your problem would be rectified. That code could look something like:

double sample = 1.0f;

__asm
{
    mov eax, 0x7FFFFFFF   // eax = 2^31 - 1
    cvtsi2sd xmm1, eax    // convert eax to double float --> xmm1
    movsd xmm0, [sample]  // where 'sample' is of type double float
    mulsd xmm0, xmm1      // Get sample's quantum into xmm0
    cvtsd2si eax, xmm0    // Round quantum to the nearest integer --> eax
}

If you wanted to keep sample as a 32-bit float instead of the double in my example, you could replace movsd xmm0, [sample] with cvtss2sd xmm0, [sample]

Given that this answer is based upon the input of multiple contributors, I've marked this as community wiki, so feel free to edit.

Community
  • 1
  • 1
Michael Petch
  • 46,082
  • 8
  • 107
  • 198
  • 1
    Hello Michael, and thanks for the elaborate, most useful answer (which was marked as *the answer*). If I understand correctly, my original code would be suitable for encoding-depths of up to 25-bits per sample (i.e. amplitude is 24 bits), but for 26-32, the conversion to a single-precision float would produce an (intrinsic) error. – Bliss Feb 17 '16 at 09:18
  • @Bliss : Thanks for accepting the answer. Since it was a group effort, and because I marked this as community wiki no one gets any reputation (including myself) but it will now at least be marked as closed/answered for people searching in the future. – Michael Petch Feb 17 '16 at 09:20
  • Shouldn't that last `cvtsd2si` instruction on your code should be replaced with `cvtss2sd`? – Bliss Feb 17 '16 at 10:22
  • @Bliss no because the value in `xmm0` is intended as a double (it was created with `mulsd` and so on) and you wanted the result as an int – harold Feb 17 '16 at 10:52
  • **Question**: Using my original code, is only the conversion of 1.0 problematic (causes overflow), or of other values as well (e.g. with magnitude higher than 2^24)? I'm asking, because If it's only 1.0, then the suggested double-precision code would consume double CPU cycles, just "for the sake" of a single end-value (1.0). **Second Question:** In such a case, maybe this corner case could be specially treated by checking for an overflow etc. (keeping the single-precision calculations)? – Bliss Feb 17 '16 at 15:53