Trying to learn x86 Assembly, getting odd results in simple test program

Question

I'm attempting to learn the basics of x64 assembly language from the 2nd Edition of Modern x86 Assembly Language Programming.

The fifth example program in chapter 2 introduces how to perform simple integer division using various width parameters, and provides a simple example test routine to showcase the result. The problem I'm getting is that I seem to be returning an erroneous result, and cannot for the life of me understand why.

The provided C++ code for the routine is as follows:

void UnsignedIntegerDiv(void)
{
    uint8_t a = 12;
    uint16_t b = 17;
    uint32_t c = 71000000;
    uint64_t d = 90000000000;
    uint8_t e = 101;
    uint16_t f = 37;
    uint32_t g = 25;
    uint64_t h = 5;
    uint64_t quo1, rem1;
    uint64_t quo2, rem2;

    quo1 = (a + b + c + d) / (e + f + g + h);
    rem1 = (a + b + c + d) % (e + f + g + h);
    UnsignedIntegerDiv_(a, b, c, d, e, f, g, h, &quo2, &rem2);

    cout << "\nResults for UnsignedIntegerDiv\n";
    cout << "a = " << (unsigned)a << ", b = " << b << ", c = " << c << ' ';
    cout << "d = " << d << ", e = " << (unsigned)e << ", f = " << f << ' ';
    cout << "g = " << g << ", h = " << h << '\n';
    cout << "quo1 = " << quo1 << ", rem1 = " << rem1 << '\n';
    cout << "quo2 = " << quo2 << ", rem2 = " << rem2 << '\n';
}

and the corresponding assembly routine is:

UnsignedIntegerDiv_ proc

; Calculate a + b + c + d
        movzx rax,cl                        ;rax = zero_extend(a)
        movzx rdx,dx                        ;rdx = zero_extend(b)
        add rax,rdx                         ;rax = a + b
        mov r8d,r8d                         ;r8 = zero_extend(c)
        add r8,r9                           ;r8 = c + d
        add rax,r8                          ;rax = a + b + c + d
        xor rdx,rdx                         ;rdx:rax = a + b + c + d

; Calculate e + f + g + h
        movzx r8,byte ptr [rsp+40]          ;r8 = zero_extend(e)
        movzx r9,word ptr [rsp+48]          ;r9 = zero_extend(f)
        add r8,r9                           ;r8 = e + f
        mov r10d,[rsp+56]                   ;r10 = zero_extend(g)
        add r10,[rsp+64]                    ;r10 = g + h;
        add r8,r10                          ;r8 = e + f + g + h
        jnz DivOK                           ;jump if divisor is not zero

        xor eax,eax                         ;set error return code
        jmp done

; Calculate (a + b + c + d) / (e + f + g + h)

DivOK:  div r8                              ;unsigned divide rdx:rax / r8
        mov rcx,[rsp+72]
        mov [rcx],rax                       ;save quotient
        mov rcx,[rsp+80]
        mov [rcx],rdx                       ;save remainder

        mov eax,1                           ;set success return code

Done:   ret

Whilst the code seems to make sense, I'm getting differing output from the C++ implementation of the routine compared to the assembly implementation (the C++ implementation is giving the expected result for quo1, rem1 - it's quo2, rem2 that are incorrect):

Results for UnsignedIntegerDiv
a = 12, b = 17, c = 71000000 d = 90000000000, e = 101, f = 37 g = 25, h = 5
quo1 = 536136904, rem1 = 157
quo2 = 0, rem2 = 90071000029

Which has left me scratching my head as to why. My guess is that there's a typo that I'm missing somewhere (although I get the same result from the downloadable code from the books GitHub page) leading to an overflow or truncation somewhere within the ASM portion.

Any help from the community would be greatly appreciated as I know that I don't know enough to fully troubleshoot this on my own.

What's the prototype for `UnsignedIntegerDiv_`? That determines what implicit conversions the compiler performs on the C args. — Peter Cordes, May 19 '20 at 20:50
Good Spot. The prototype for 'h' was set to uint32_t rather than uint64_t ... I've been staring at the code for far too long, and managed to completely miss it. Thanks. — Mark Edwards, May 19 '20 at 20:53
One debugging technique that could have worked is to single-step with a debugger, *including the compiler-generated code*. Starting with your own code, but single-step and watch register values. When you see `add r10,[rsp+64]` produce garbage in the upper half of `r10`, then you'd check that stack memory and see that only the low 4 bytes were set appropriately, that would be a big hint in the right direction to check why the compiler isn't clearing the high 4 bytes of what you wanted to be a 64-bit arg. — Peter Cordes, May 19 '20 at 20:57
Also, code review: prefer 32-bit operand-size whenever possible to make the machine code smaller. e.g. let implicit zero-extension to 64-bit do its job with `movzx eax, cl` or `xor edx,edx`, like you do for `g`. Also prefer `mov` / `movzx` to a different register [so modern CPUs can optimize internally with mov-elimination](https://stackoverflow.com/questions/44169342/can-x86s-mov-really-be-free-why-cant-i-reproduce-this-at-all), saving a uop for the execution ports and adding 0 instead of 1 cycle latency to that critical path dependency chain. e.g. `mov ecx, r8d` instead of `mov r8d,r8d` — Peter Cordes, May 19 '20 at 21:03
That's pretty minor, though; overall nicely done. No wasted instructions, and doing the sum of 4 values as 2 pairs -> 1 final is good for latency / OoO exec (instruction level parallelism). Branch layout: make non-error the fall-through fast path with no taken branches. `jz` to a notOk block after the main `ret` that returns zero. Normally that block is never reached at all so if it's not in the same i-cache line it never needs to be touched by the CPU. Also, zeroing a register is cheaper than setting it to 1, so you could make `0` = no problems as the return code your caller ignores... — Peter Cordes, May 19 '20 at 21:09
Thanks Peter. Some very useful explanations there. I'm still at an early learning stage, so some of this is a little beyond my ken at the moment. But I'm getting there :) — Mark Edwards, May 19 '20 at 21:12
Sure, the reasons why are pretty deep, but rule of thumb: lay out your branches so the common case has no taken branches, when convenient. Having multiple `ret` blocks is totally fine (and called "tail duplication"). And a more minor code-size rule of thumb: prefer 32-bit operand size (and the "legacy" eax..edi registers, not r8d..r15d) when it doesn't cost any extra instructions, letting the CPU do implicit zero-extension to 64-bit. Most instructions other than division (or multiply on some CPUs) are the same speed for 32 and 64-bit, so it's just code size. But 64-bit div is slower. :/ — Peter Cordes, May 19 '20 at 21:16

score 1 · Accepted Answer · answered May 19 '20 at 20:55

Looks like I've been suffering code blindness :)

The prototype for the UnsignedIntegerDiv_ function had an incorrect datatype for one of the parameters (h in the definition below):

extern "C" int UnsignedIntegerDiv_(uint8_t a, uint16_t b, uint32_t c, uint64_t d, uint8_t e, uint16_t f, uint32_t g, uint32_t h, uint64_t* quo, uint64_t* rem);

instead of

extern "C" int UnsignedIntegerDiv_(uint8_t a, uint16_t b, uint32_t c, uint64_t d, uint8_t e, uint16_t f, uint32_t g, uint64_t h, uint64_t* quo, uint64_t* rem);

Thanks to @Peter Cordes for pointing me in the right direction.

Trying to learn x86 Assembly, getting odd results in simple test program

1 Answers1