Square root in assembly, how to shift and change bits

Question

I want to write a fast integer square root algorithm in assembly, it takes unsigned 32-bit. I've been reading through this, and got an idea. Here's my pseudocode:

res <- 0
for i from 15 downto 0 do:
   change the ith bit of result to 1
   if res^2 > x then:
      change the ith bit of res back to 0
return res

I've done up to here so far:

sqrt:
  movl $0, %eax
  movl $15, %edx
  jmp .L8
.L9

.L8
  cmpq cmpq $0, %edx
  jge .L9

I'm stuck at the for loop operations, changing the ith bit and shifting. I also don't wanna use division or sqrt instructions. I know I should probably use shr, but I have no idea where to start or how to do it. How can I do the operations in the for loop? Where do I start?

What exactly is the problem with [shr and similar](http://x86.renejeschke.de/html/file_module_x86_id_285.html)? Setting and clearing bits is done as usual: with OR and AND masks. — Margaret Bloom, Sep 29 '16 at 23:43
My problem is that if I get an integer (setting x to for example 20), how am I supposed to turn that into binary and go through the bits using shift? — Andrew, Sep 29 '16 at 23:48
Numbers are numbers, binary is only used when you want to think of or write down a number. you don't need to convert anything. Google for AND, OR, SHIFT you'll find tons of material. — Margaret Bloom, Sep 29 '16 at 23:56
You probably won't beat the hardware double-precision square root instruction (on [Intel Skylake](http://stackoverflow.com/a/39683107/224132): 15-16 cycle latency with a throughput of one per 4-6 cycles). A `double` can exactly represent every integer, so the only trick is getting an *unsigned* integer converted to a `double`. This is easy in 64-bit mode: just zero extend it to 64-bit and use `CVTSI2SD xmm0, rax`, since every unsigned 32-bit integer fits in a signed 64-bit integer. In 32-bit code, [it takes some work](http://stackoverflow.com/q/11406654/224132), but it's not bad. — Peter Cordes, Sep 30 '16 at 11:15
For 64-bit integers, x87 80-bit floats can exactly represent every *signed* integer. I think you might have a problem with unsigned 64-bit, though. I think the mantissa size is too small for every integer to have an exact representation in 80-bit (long double). (Also note that even in 32-bit mode, you can use [x87 FILD](http://www.felixcloutier.com/x86/FILD.html) to convert to/from 64-bit integers, and x87 FSQRT is not much slower than SSE2 SQRTSD). — Peter Cordes, Sep 30 '16 at 11:31
Not that you shouldn't finish your project; it's hopefully a fun way to learn some asm. Just that a loop won't be faster in practice (especially because branch mispredicts are inevitable unless you do something branchless.) — Peter Cordes, Sep 30 '16 at 11:34

score 3 · Accepted Answer · answered Sep 30 '16 at 10:29

(Intel syntax, convert to AT&T on your own)

    mov   ebx,<number> ; *number* to find sqrt of
    mov   ecx,0x8000   ; bitmask (starting with b15 bit set)
    ;^^^ 0x8000 = decimal 32768 = binary 1000 0000 0000 0000
    xor   eax,eax      ; result <- 0
sqrt_loop:
    xor   eax,ecx      ; set bit in eax
    push  eax          ; store result (will be destroyed by mul)
    mul   eax          ; edx:eax <- eax*eax (ignoring edx next)
    cmp   eax,ebx      ; compare with *number*
    pop   eax          ; restore result
    jbe   keep_bit     ; res^2 <= *number* -> bit stays set
    xor   eax,ecx      ; unset bit in eax
keep_bit:
    shr   ecx,1        ; next bit
    jnz   sqrt_loop    ; loop till all bits are tried

(I didn't tried+debug it, so there may be some bug. But I think together with your pseudo algorithm and your rewrite to AT&T with debugging this should be enough to get you started)

As Margaret pointed out, number is number, it's the value. So 0x8000 is already encoded in CPU wires as b15 set to 1 and other bits set to 0. All the conversion stuff happens when you want to convert the value from/into string, but as long as you are calculating with values, it's there in the register in all forms at the same time. It just depends, how you look at the register. Using hexa/decimal/binary in source is that, writing STRING representation of number, which gets turned into the value itself by assembler.

The binary representation is special, as the CPU can address particular bits (with and/xor/or, rotations, bit test/set, etc), as it has those values in sort of "wires" and it's native representation for it. It's like when human is "cheating" when calculating "10*3456", writing just additional 0 at the end to get result, because in decimal format that's how 10* is special. For CPU the same happens with bit manipulation and all kind of power of 2 math. But the decimal tricks are not possible, those has the CPU to calculate in proper way, multiplying by 10 for real.

Anyway, when you have only the bit number, and you want to get the bitmask itself, like how to get 0x8000 from 15:

mov   ecx,15  ; i-th bit
mov   eax,1   ; set b0 (lowest bit)
shl   eax,cl  ; shift all bits (all zeroed + b0 set) cl-many times left
; eax now contains 0x8000 = b15 set, other bits zeroed

So if you would stick with your way of algorithm, you would have to recalculate the for counter to bit mask every time (or use some bit set/reset instructions, which I don't know from head, as almost never needed them).

But if you study my code, you will see there's direct shortcut to work over the bitmask itself, without counting the "i-th bit" part, making the code shorter and faster (although I probably killed it by that push/pop, maybe using some one more register like esi to store the value would be better ... then again this demonstrates how stack can be used, and also how flags are not affected by certain instructions, so you can use cmp results in postponed way, if you are careful to not modify the required flag).

Also worth mentioning: `bts` and `btr` are the fastest way to set/clear the i-th bit of a register. (They also set CF from the bit that was there before, but that's not a problem). Also, push/pop inside the inner loop is pretty silly. You could use a single MOV to an extra register, since you're nowhere near running out. (And you should use `imul r32, r32` when you don't want the upper half, so you don't have to destroy EDX. It's faster than `mul r32` because it only has to produce one register output.) — Peter Cordes, Sep 30 '16 at 11:40
This is a great answer, thank you so much for taking the time. — Andrew, Oct 06 '16 at 05:36

Square root in assembly, how to shift and change bits

1 Answers1