3

I am trying to test an addition function in TDM-GCC 64 bit assembly in Windows. I searched for resources on this a while back and I came across a code similar to this(I made some changes to compile this in TDM-GCC).

typedef struct
{
    int size;
    __uint64_t uints[130];
} BigInteger;

void add(BigInteger *X, BigInteger *Y);   // X += Y
    // %rcx holds address of X, and %rdx holds address of Y apparently.

    // calc.s - assembly file
    .globl add
add:
    movq    8(%rdx), %rax
    addq    %rax, 8(%rcx)
    movq    16(%rdx), %rax
    adcq    %rax, 16(%rcx)
    movq    24(%rdx), %rax
    adcq    %rax, 24(%rcx)
    ...     ...

This first assembly code works. The downside is even for small numbers it would take just as long as calculating the largest size. So instead I made it check the size of X and Y and put a loop with sized condition so that it won't always have to add the whole array if X and Y are not big.

    ...
// %r13 holds X addr, %r14 holds Y addr.
    addq    $8, %r13   // I have tried  incq %r13
    addq    $8, %r14   // I have tried  incq %r14
    movq    (%r14), %rax
    addq    %rax, (%r13)
    decl    %ecx
    cmpl    $0, %ecx
    je      .add4
.add3:
    addq    $8, %r13   // I have tried  incq %r13
    addq    $8, %r14   // I have tried  incq %r14
    movq    (%r14), %rax
    adcq    %rax, (%r13)
    loop    .add3
.add4:
    ...

But I was too simple to think that adding 8 bytes to X and Y's address (%r13, %r14) using ADDQ operator would make it possible to iterate through the array. The problem here is that if I use ADDQ operator like this, it resets carry flag to 0, so the entire loop which calculates addition with carry (.add3) breaks down. I tried using

    incq %r13

thinking incq would work similarly to incrementing a pointer in C++ which knows by how many bytes it should move. But this only increment register values by 1, not 8 which I thought it would move by.


So my question is: Is there a way in assembly to increment a register by number larger than 1, or do addition without touching the Carry Flag at all? (Not add, adc because both of them set the carry flag at the end)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 2
    Depending on your cpu model you might have access to the `ADOX` instruction specifically designed for this purpose. Also note you can of course just scale your index by 8 in the effective address, no need to increment by 8. – Jester Oct 15 '18 at 12:33
  • 3
    If you care about performance on Intel CPUs, don't use the slow `loop` instruction. [Why is the loop instruction slow? Couldn't Intel have implemented it efficiently?](https://stackoverflow.com/q/35742570). It's only fast on AMD, but very slow on Intel. On Sandybridge and later, you can use `dec ecx / jnz` without suffering from partial-flag stalls in an ADC loop. [Problems with ADC/SBB and INC/DEC in tight loops on some CPUs](https://stackoverflow.com/q/32084204) But that's a problem on Nehalem and earlier Intel P6-family CPUs. – Peter Cordes Oct 15 '18 at 14:14
  • 1
    Also, prefer memory-source `adc` and a separate `mov` store, if that works for your loop. I think that saves a fused-domain uop on Intel CPUs, [because of a microarchitectural quirk](/questions/17395557/observing-stale-instruction-fetching-on-x86-with-self-modifying-code#comment68191840_18388700) that required memory-destination ADC to have an extra uop. Especially on Broadwell and later where register-destination `adc` is a single uop. ([except for the `adc eax, imm32` special encoding which is still 2 uops](/posts/comments/92389556). – Peter Cordes Oct 15 '18 at 14:17

1 Answers1

6

Use load effective address with register + offset

        lea     rax,[rax+8]    ;add 8 to rax
rcgldr
  • 27,407
  • 3
  • 36
  • 61