Your code is also broken for negative divisors: divide(5,-2)
will give zero. This is purely explained by calling-convention. Your zero-extension instead of sign-extension bug (see @paxdiablo's answer) only matters for negative dividends.
You told the compiler your function takes int
args, and int
is a 32-bit type in the x86-64 System V calling convention.
You're assuming your inputs are sign-extended to 64-bit, but the calling convention doesn't require that, so the compiler won't waste code-size on 10-byte mov r64, imm64
when it can use 5-byte mov r32, imm32
instead.
For more details, see these Q&As. (the 2nd is basically a duplicate of the first):
Thus your compiler will emit code like this for your main
:
mov edi, 5 ; RDI = 0x0000000000000002
mov esi, -2 ; RSI = 0x00000000FFFFFFFE
call _divide
I checked on the Godbolt compiler explorer, and that's what gcc and clang really do1, even for un-optimized code.
For divide(5,-2)
, your code will result in
- RDX=0, RAX=5. i.e. dividend = 0x0000000000000000:0000000000000005, which is correct. (zero- and sign-extension are the same operation for non-negative inputs).
- divisor = 0x00000000FFFFFFFE = +4294967294, which is large and positive.
64-bit idiv
calculates 5 / 4294967294
producing quotient=RAX=0, remainder=RDX=5.
If you only fixed the type-width / operand-size mismatch bug, you'd still have problems with negative dividends like @paxdiablo's answer explains. But both fixes are necessary for divide(-554,2)
to actually work.
So how should you have written it?
You could change the prototype to int64_t
or long
(which is 64-bit in x86-64 System V), and use cqo
to set up for signed division. (When and why do we sign extend and use cdq with mul/div?)
Or you could sign-extend your 32-bit inputs to 64-bit, with movsxd rax, edi
/ movsxd rcx, esi
. But that would be silly. Just use 32-bit operand-size since that's what you told the compiler to pass.
That's good because 64-bit division is much slower than 32-bit division. (https://agner.org/optimize/, and C++ code for testing the Collatz conjecture faster than hand-written assembly - why?).
This is what I'd do:
global _divide
; inputs: int32_t dividend in EDI, int32_t divisor in ESI
; output: int32_t quotient in EAX, int32_t remainder in EDX
; (C callers won't be able to access the remainder, unfortunately)
_divide:
mov eax, edi
cdq ; sign-extend the dividend into edx:eax
idiv esi ; no need to copy to ecx/rcx first
ret
No need to push RBP; we're not calling any other functions so realigning the stack doesn't matter, and we're not modifying RBP for use as a frame pointer.
We're allowed to clobber RDX without saving/restoring it: it's a call-clobbered register in x86-64 System V, and Windows x64. (Same as in most 32-bit calling conventions). This makes sense because it's used implicitly by some common instructions like idiv
.
That's what gcc and clang do emit (with optimization enabled of course) if you write it in C.
int divide(int dividend, int divisor) {
return dividend / divisor;
}
(See the Godbolt link above, where I included it with __attribute__((noinline))
so I could still see main
actually setting up function args. I could have just named it something else instead.)
As usual, looking at compiler output to see the difference between your code and what the compiler did can clue you in to something you did wrong. (Or give you a better starting point for optimization. In this case the compilers don't have any missed optimizations, though.) See How to remove "noise" from GCC/clang assembly output?.
You can change the types to long
(which is 64-bit in x86-64 System V, unlike in Windows x64) if you want to see code-gen for 64-bit integers. And also see how the caller changes, e.g.
mov edi, 5
mov rsi, -2
call _divide
Footnote 1: Interestingly clang -O3
's asm output has mov esi, -2
, but clang -O0
writes it as mov edi, 4294967294
.
Those both assemble to the same instruction, of course, zeroing the upper 32 bits of RDI, because that's how AMD designed AMD64, rather than for example implicitly sign-extending into the full register, which would have been a valid design choice but probably not quite as cheap as zero-extending.
And BTW, Godbolt has compilers targeting Linux, but that's the same calling convention. The only difference is that OS X decorates function names with a leading _
but Linux doesn't.