Test case for adcx and adox

Question

I'm testing Intel ADX add with carry and add with overflow to pipeline adds on large integers. I'd like to see what expected code generation should look like. From _addcarry_u64 and _addcarryx_u64 with MSVC and ICC, I thought this would be a suitable test case:

#include <stdint.h>
#include <x86intrin.h>
#include "immintrin.h"

int main(int argc, char* argv[])
{
    #define MAX_ARRAY 100
    uint8_t c1 = 0, c2 = 0;
    uint64_t a[MAX_ARRAY]={0}, b[MAX_ARRAY]={0}, res[MAX_ARRAY];
    for(unsigned int i=0; i< MAX_ARRAY; i++){ 
        c1 = _addcarryx_u64(c1, res[i], a[i], (unsigned long long int*)&res[i]);
        c2 = _addcarryx_u64(c2, res[i], b[i], (unsigned long long int*)&res[i]);
    }
    return 0;
}

When I examine the generated code from GCC 6.1 using -O3 and -madx, it reveals serialized addc. -O1 and -O2 produces similar results:

main:
        subq    $688, %rsp
        xorl    %edi, %edi
        xorl    %esi, %esi
        leaq    -120(%rsp), %rdx
        xorl    %ecx, %ecx
        leaq    680(%rsp), %r8
.L2:
        movq    (%rdx), %rax
        addb    $-1, %sil
        adcq    %rcx, %rax
        setc    %sil
        addb    $-1, %dil
        adcq    %rcx, %rax
        setc    %dil
        movq    %rax, (%rdx)
        addq    $8, %rdx
        cmpq    %r8, %rdx
        jne     .L2
        xorl    %eax, %eax
        addq    $688, %rsp
        ret

So I'm guessing the test case is not quite hitting the mark, or I am doing something wrong, or I am using something incorrectly, ...

If I am parsing Intel's docs on _addcarryx_u64 correctly, I believe the C code should generate the pipeline. So I'm guessing I am doing something wrong:

Description

Add unsigned 64-bit integers a and b with unsigned 8-bit carry-in c_in (carry or overflow flag), and store the unsigned 64-bit result in out, and the carry-out in dst (carry or overflow flag).

How can I generate the pipeline'd add with carry/add with overflow (adcx/adox)?

I've actually got a 5th generation Core i7 ready for testing (notice the adx cpu flag):

$ cat /proc/cpuinfo | grep adx
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush
dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc
arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni
pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid sse4_1
sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm
3dnowprefetch ida arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase
tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt
...

I think these intrinsics are mostly there because MSVC does not allow inline-assembly in 64-bit mode. With GCC you're expected to use inline assembly in this case. In fact the best way to use `adc` which has been around for decades with GCC is inline assembly. It's nice to have inline assembly as an option but it's too bad it' such as PITA to use in GCC. — Z boson, Sep 06 '16 at 07:49

score 1 · Accepted Answer · answered Sep 04 '16 at 21:55

This does look like a good test-case. It assembles to correct working code, right? It's useful for a compiler to support the intrinsic in that sense, even if it doesn't yet support making optimal code. It lets people start using the intrinsic. This is necessary for compatibility.

Next year or whenever the compiler's backend support for adcx/adox is done, the same code will compile to faster binaries with no source modification.

I assume that's what's going on for gcc.

clang 3.8.1's implementation is more literal, but it ends up doing a terrible job: flag-saving with sahf and push/pop of eax. See it on Godbolt.

I think there's even a bug in the asm source output, since mov eax, ch won't assemble. (Unlike gcc, clang/LLVM uses a built-in assembler and doesn't actually go through a text representation of asm on the way from LLVM IR to machine code). The disassembly of the machine code shows mov eax,ebp there. I think that's also a bug, because bpl (or the rest of the register) doesn't have a useful value at that point. Probably it wanted mov al, ch or movzx eax, ch.

Update: clang3.9 and 4.0 crash on that source, clang5.0 compiles it reasonably. (Using only adcx, but with enough unrolling to enable ILP by saving/restoring the carry for each chain separately.) — Peter Cordes, Oct 24 '17 at 22:39

Pierre · Answer 2 · 2017-10-24T20:10:43.350

1

When GCC will be fixed to generate much better inlined code for add_carryx_... , be careful with your code, because the loop variant contains a comparison (modifies the C and O flags similarly to sub instruction) and an increment (modifies the C and O flags like an add instruction).

  for(unsigned int i=0; i< MAX_ARRAY; i++){ 
        c1 = _addcarryx_u64(c1, res[i], a[i], (unsigned long long int*)&res[i]);
        c2 = _addcarryx_u64(c2, res[i], b[i], (unsigned long long int*)&res[i]);
    }

For that reason, c1 and c2 in your code will always be pitifuly handled (saved and restored in temp registers at each loop iteration). And the resulting code generated by gcc will still look like the assembly you provided, for good reasons.

From a run-time point of view, res[i] is an immediate dependency between the 2 add_carryx instructions, the 2 instructions are not really independent and will not benefit from a possible architectural parallelism in the processor.

I understand the code is only an example, but maybe it will not be the best example to use when gcc will be modified.

The addition of 3 numbers in large integer arithmetic is a tough problem; vectorization helps, and then you better use addcarryx to handle the loop variants in parallel (increment and comparison+branch on the same variable, yet another tough problem).

edited Oct 24 '17 at 20:10

answered Oct 24 '17 at 20:04

Pierre

437
3
4

clang5.0 unrolls the loop enough to be useful. (https://godbolt.org/g/2NTfVs) It's actually an interesting test to have the 2nd carry chain dependent on the first. But note that it's only a one-way dependency: the `res[] += a[]` chain can run ahead of the `res[] += b[]` chain, which is what clang does. (Then reuses those 4 `res[]` values while they're still in registers.) – Peter Cordes Oct 24 '17 at 22:45
Good point that this needs loop unrolling to avoid saving/restoring carry every iteration (unless you loop without flags, using `lea` and `jrcxz`, or `loop`, [but those are unfortunately not as efficient except on AMD](https://stackoverflow.com/questions/35742570/why-is-the-loop-instruction-slow-couldnt-intel-have-implemented-it-efficiently) – Peter Cordes Oct 24 '17 at 22:47
Thanks for the link to godbolts. Looking at different code generated by different compilers, adcx is used as if it was adc, and adox is not used. You are right, with unrolling a few iterations, the 2 dependency chains could be interleaved, and pushf/popf could be used to save/restore both flags at loop variant time ..... – Pierre Nov 03 '17 at 16:04
`popf` is very slow (9 uops, throughput = 18c on Haswell for example). IDK why it's that slow in ring3 where it can't update IF, just DF and condition codes. But LAHF / SAHF is single-uop, and a much better choice for saving/restoring CF. The way compilers use `setc` / `cmp` is probably at least as good, though. – Peter Cordes Nov 03 '17 at 16:46
Bit OF is at position 11 in eflag register, not reachable by LAHF/SAHF. then it must be something like SETC al, SETO bl to save the 2 flags. what is the shortest way to restore both flags ? SHL bl,7 ; OR al, bl; RCR al,1 (set OF from bit 7 ^ 0 and CF from bit 0) – Pierre Nov 03 '17 at 17:53
Oh, yeah if compilers wanted to actually interleave ADCX/ADOX. I forgot what the real question was and was just thinking about interleaving two ADC dep chains like clang does now. Good idea with `rcr al,1`, but unfortunately it's 3 uops, and the whole save/restore procedure costs more than replacing `dec` / `jnz` with `lea` / `jrcxz` to loop without modifying flags. Obviously unrolling becomes valuable when loop overhead increases because of preserving flags one way or another. – Peter Cordes Nov 03 '17 at 18:06

Test case for adcx and adox

2 Answers2

Linked

Related