sqrt of uint64_t vs. int64_t

Question

I noticed that calculating the integer part of square root of uint64_t is much more complicated than of int64_t. Please, does anybody have an explanation for this? Why is it seemingly much more difficult to deal with one extra bit?

The following:

int64_t sqrt_int(int64_t a) {
    return sqrt(a);
}

compiles with clang 5.0 and -mfpmath=sse -msse3 -Wall -O3 to

sqrt_int(long):                           # @sqrt_int(long)
        cvtsi2sd        xmm0, rdi
        sqrtsd  xmm0, xmm0
        cvttsd2si       rax, xmm0
        ret

But the following:

uint64_t sqrt_int(uint64_t a) {
    return sqrt(a);
}

compiles to:

.LCPI0_0:
        .long   1127219200              # 0x43300000
        .long   1160773632              # 0x45300000
        .long   0                       # 0x0
        .long   0                       # 0x0
.LCPI0_1:
        .quad   4841369599423283200     # double 4503599627370496
        .quad   4985484787499139072     # double 1.9342813113834067E+25
.LCPI0_2:
        .quad   4890909195324358656     # double 9.2233720368547758E+18
sqrt_int(unsigned long):                           # @sqrt_int(unsigned long)
        movq    xmm0, rdi
        punpckldq       xmm0, xmmword ptr [rip + .LCPI0_0] # xmm0 = xmm0[0],mem[0],xmm0[1],mem[1]
        subpd   xmm0, xmmword ptr [rip + .LCPI0_1]
        haddpd  xmm0, xmm0
        sqrtsd  xmm0, xmm0
        movsd   xmm1, qword ptr [rip + .LCPI0_2] # xmm1 = mem[0],zero
        movapd  xmm2, xmm0
        subsd   xmm2, xmm1
        cvttsd2si       rax, xmm2
        movabs  rcx, -9223372036854775808
        xor     rcx, rax
        cvttsd2si       rax, xmm0
        ucomisd xmm0, xmm1
        cmovae  rax, rcx
        ret

As Zalman points out, it is converting the arg to `double` (because that's what your C source does), so it's not strictly the integer part of the sqrt. Not all integers above 2^53 can be exactly represented as `double`, so this can involve rounding. https://en.wikipedia.org/wiki/Double-precision_floating-point_format — Peter Cordes, Dec 06 '17 at 23:57

Zalman Stern · Accepted Answer · 2017-12-06T21:47:57.340

First off, you need to be clear that this code is converting 64-bit integers (signed or unsigned) to double precision floating point, taking the square root, and then casting the result back to a signed or unsigned integer.

The answer to your question is because Intel provided 64-bit signed integer to double precision floating-point conversion (and the opposite) in the instruction set you are compiling for, but did not do so for the unsigned case. They added the unsigned conversion instruction in AVX-512, but it does not exist prior to that. So for the signed case, the conversion to double precision and the conversion back are one instruction each. For the unsigned case, the compiler has to synthesize the conversion operation from available instructions.

You can get information on which instructions are available in which versions of SSE2/AVX/AVX-512, etc. here: https://software.intel.com/sites/landingpage/IntrinsicsGuide/

You can see discussion of the method used to synthesize the conversion here: Are there unsigned equivalents of the x87 FILD and SSE CVTSI2SD instructions?

score 4 · Answer 2 · answered Dec 07 '17 at 11:51

In addition to Zalman's excellent answer: The result of the sqrt is always less than INT64_MAX because the input of sqrt is in the uint64_t range. Therefore a single cvttsd2si is sufficient to convert the double back to uint64_t. In other words: For all input values a the function

uint64_t sqrt_int(uint64_t a) {
    return sqrt(a);
}

has exactly the same behaviour as the modified code

uint64_t sqrt_int(uint64_t a) {
    return (int64_t)sqrt(a);
}

The latter function compiles to

.LCPI0_0:
  .long 1127219200 # 0x43300000
  .long 1160773632 # 0x45300000
  .long 0 # 0x0
  .long 0 # 0x0
.LCPI0_1:
  .quad 4841369599423283200 # double 4503599627370496
  .quad 4985484787499139072 # double 1.9342813113834067E+25
sqrt_int(unsigned long): # @sqrt_int(unsigned long)
  movq xmm0, rdi
  punpckldq xmm0, xmmword ptr [rip + .LCPI0_0] # xmm0 = xmm0[0],mem[0],xmm0[1],mem[1]
  subpd xmm0, xmmword ptr [rip + .LCPI0_1]
  haddpd xmm0, xmm0
  sqrtsd xmm0, xmm0
  cvttsd2si rax, xmm0
  ret

which is much less instruction than the original code.

sqrt of uint64_t vs. int64_t

2 Answers2

Linked