2

Intel introduced ADCX, AODX to support big integer arithmetic but it is hard to generate ADCX/AODX using only one function: _addcarryx_u64

Suppose that for big integer addition of A and B where the integer is represented by an array of uint64_t denoted by pa and pb respectively. Here's a possible implementation (pa and pb are kept to be of same length n and the overflow of the result is not the concern` for simplicity).

#include <immintrin.h>
using limb_type = uint64_t;

void add(const limb_type * __restrict__ pa, const limb_type * __restrict__ pb, limb_type * __restrict__ pr, unsigned n) {
    unsigned char carry = 0;
    unsigned i;
    for(i=0;i<n;i++) {
        carry = _addcarryx_u64(carry, pa[i], pb[i], &pr[i]);
    }
}

Godbolt shows:

        mov     rcx, QWORD PTR [rdi+rax]  # tmp106, MEM[base: pa_16(D), index: ivtmp.11_26, offset: 0B]
        add     r8b, -1   # carry: to set CF Flag for subsequent adc
        adc     rcx, QWORD PTR [rsi+rax]  # tmp106, MEM[base: pb_15(D), index: ivtmp.11_26, offset: 0B]
        mov     QWORD PTR [rdx+rax], rcx  #* ivtmp.11, tmp106
        setc    r8b     #, carry

        add     rax, 8    # ivtmp.11,
        cmp     r9, rax   # _20, ivtmp.11
        jne     .L3       #,

The problem is that the loop increment resets CF flag which in the next iteration is set by instruction number 2. Thus there are redundant instructions that are being executed: 2nd instruction (ADD) and 5th instruction SETC.

Solution: The need to save and set the CF flag can be eliminated if somehow it is possible to save the carry to OF flag and use ADOX instead of ADC as generated by GCC/CLANG.

Since inline ASM is not recommended by GCC and I want to consider it as my last resort but is there a way to do this?

Note that this is a performance critical loop and that's why I want to eliminate 2 instructions.

madhur4127
  • 276
  • 2
  • 13
  • 1
    AFAIK, gcc and clang are both terrible at optimizing `_addcarryx_u64` and there's nothing you can do to get them to emit a decent asm loop for you. You probably *do* still want to write this loop in inline asm (make sure you use a `"memory"` clobber and don't modify input-only operands). `adox` is not helpful at all here vs. `adc`, and what you actually want is an `inc` or `dec/jnz` loop with pointer increments using LEA to avoid disturbing CF. That's fine on modern CPUs post P6-family: [Problems with ADC/SBB and INC/DEC in tight loops on some CPUs](https://stackoverflow.com/q/32084204) – Peter Cordes May 03 '20 at 13:45
  • 1
    `cmp` writes all flags (including OF and CF); that's why `adox` / `adcx` are not helpful. You only need `adc` for a single dependency chain. Some things leave CF unmodified, but any way of looping that touches flags at all destroys OF. – Peter Cordes May 03 '20 at 13:48
  • 1
    @fuz: perhaps they meant to say that Intel introduced the ADX extension (ADCX + ADOX). It exists to allow ILP in biginteger *multiplication* where you can actually have multiple chains of carry-propagation: https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf. Or if you were doing two separate biginteger additions in parallel, interleaving operations on two chains. – Peter Cordes May 03 '20 at 13:51
  • There are several existing answers that cover the current GCC situation; I linked them as duplicates, especially [\_addcarry\_u64 and \_addcarryx\_u64 with MSVC and ICC](https://stackoverflow.com/a/45796549) has mailing list links. Inline asm is only recommended as a last resort when you can't get decent asm from intrinsics, and you can spend the time to make sure you got the constraints right and don't have any lurking bugs. But this is one of the few cases where inline asm is justified: compilers (except maybe ICC) are garbage at any pure C / intrinsics way of writing it. – Peter Cordes May 03 '20 at 14:22
  • @PeterCordes, yes there is discussion on big interfere addition but my original intention as to force the compiler to generate ADOX instructions, if ADOX is generated as I wanted then maybe optimizer can generate `inc/dec` with `lea` and `jnz`. But there is no discussion on the topic that the title of the question suggests. The example was kind of a motivating one but can be removed. – madhur4127 May 04 '20 at 03:58
  • You can't get GCC to generate ADOX because it wouldn't be any help. `dec/jnz` would destroy OF but not CF, making it possible to use with ADC or ADCX, but not ADOX. So if that's what you were asking, I can reopen your question and post that as an answer. The fact that GCC makes a mess is unrelated to using only `adc` instead of `adox`. (You could try with `-march=skylake` or something so GCC knows that `inc` is efficient.) – Peter Cordes May 04 '20 at 04:02
  • @PeterCordes, I have read in one of the links that you posted that GCC won't support ADX because GCC doesn't understand OF and CF carry chains. Its a limitation. Anything that causes a significant perf gain even if it is not optimal like asm can be an accepted answer. – madhur4127 May 04 '20 at 04:11
  • 1
    I am also intrigued by unrolling the loop like this: [Godbolt link](https://godbolt.org/z/tCDVy9). Sadly neither GCC nor ICC doesn't optimize the unrolled version like clang does. – madhur4127 May 04 '20 at 04:15
  • Yup, looks like clang does a good job within the loop body, so a bit of manual unrolling will amortize the badness of save/restore CF across loop iterations. But GCC is still garbage; probably nothing you can do in the source to get GCC to emit better code. I know that's not what you want to hear, but I seriously doubt there's any way other than inline asm to get a non-bad loop out of GCC. And if there is, it would also be a valid answer to one of the other questions and could / should be posted there. (So no need to reopen this except to explain that OF gets destroyed by `dec`.) – Peter Cordes May 04 '20 at 04:34
  • The downside to using inline asm is it precludes any chance that the compiler might be able to optimize away parts of the loop. If (say) `n` were 1, there might be ways to shortcut your existing code. But once you add asm in gcc, it's always going to execute every single one of the instructions inside that template every time. – David Wohlferd May 04 '20 at 05:55

0 Answers0