Trying to implement 128 bit add in amd64 with inline assembly with multiple alternative constraints

Question

Trying to get usable 128-bit operations in GCC on amd64, I implemented some inline functions. Like add_128_128_128. I wanted to let the compiler decide, which registers to use as inputs and outputs for most flexibility. So, I used the multiple alternative constraints.

inline __uint128_t add_128_128_128(__uint128_t a, __uint128_t b) {
        uint64_t a_hi = a >> 64;
        uint64_t a_lo = a;
        uint64_t b_hi = b >> 64;
        uint64_t b_lo = b;
        uint64_t retval_hi;
        uint64_t retval_lo;

        asm (
                "\n"
                "       add     %2, %0\n"
                "       adc     %3, %1\n"
                : "=r,r,r,r" (retval_lo)
                , "=r,r,r,r" (retval_hi)
                : "r,0,r,0" (a_lo)
                , "0,r,0,r" (b_lo)
                , "r,1,1,r" (a_hi)
                , "1,r,r,1" (b_hi)
        );

        return ((__uint128_t)retval_hi) << 64 | retval_lo;
}

Now, the generated assembler output is:

_Z11add_128_128oo:
        movq    %rdx, %rax
        movq    %rcx, %rdx
        add     %rdi, %rax
        adc     %rax, %rdx
        ret

What puzzles me is how to get the adc instruction fixed. From thinking about this, I came to the temporary conclusion, that even the matching constraints get "new" numbers, which would explain the %rax being %3 == %0 == %rax. So, is there a way to tell GCC to only count the "r" constraints? (I know that I can get this inline assembly to work by just giving up on multiple alternative constraints.)

BTW: Is there any useful documentation of GCC'S inline assembly? The official manual with zero examples when it comes to the interesting stuff is nothing I would call useful in this context. Searching with Google didn't make me find any. All howtos and stuff just speak about the trivial basic things but completely omit more advanced stuff like multiple alternative constraints just completely.

A better guide to GCC inline assembly, with a focus on x86[-64], can be found [here](http://locklessinc.com/articles/gcc_asm/). — Brett Hale, Sep 20 '13 at 08:13
For 128-bit integers specifically, use `unsigned __int128` instead of inline asm. [Is there a 128 bit integer in gcc?](https://stackoverflow.com/q/16088282) — Peter Cordes, Dec 06 '22 at 16:57

FrankH. · Accepted Answer · 2013-09-20T08:42:57.180

The first thing that comes to mind is:

inline __uint128_t add_128_128_128(__uint128_t a, __uint128_t b) {
    asm("add %1, %%rax\n\t"
        "adc %2, %%rdx"
        : "+A"(a)
        : "r"((uint64_t)(b >> 64)), "r"((uint64_t)b)
        : "cc");
    return a;
}

that's because GCC can treat RDX:RAX as double-sized register pair with the "A" constraint. This is sub-optimal though particularly for inlining, because it doesn't take into account that the two operands are interchangeable, and by returning always in RDX:RAX it also restrains the register choices.

To get that commutativity in, you can use the % constraint modifier:

inline __uint128_t add_128_128_128(__uint128_t a, __uint128_t b) {
    uint64_t a_lo = a, a_hi = a >> 64, b_lo = b, b_hi = b >> 64;
    uint64_t r_lo, r_hi;
    asm("add %3, %0\n\t"
        "adc %5, %1"
        : "=r"(r_lo), "=r"(r_hi)
        : "%0" (a_lo), "r"(b_lo), "%1"(a_hi), "r"(b_hi) :
        : "cc");
    return ((__uint128_t)r_hi) << 64 | r_lo;
}

The % indicates to GCC that this operand and the next one are interchangeable.
This creates the following code (non-inlined):

Disassembly of section .text:

0000000000000000 <add_128_128_128>:
   0:   48 89 f8                mov    %rdi,%rax
   3:   48 01 d0                add    %rdx,%rax
   6:   48 11 ce                adc    %rcx,%rsi
   9:   48 89 f2                mov    %rsi,%rdx
   c:   c3                      retq

which looks pretty much like what you wanted ?

Unfortunately, even as of [gcc-4.8.1](http://gcc.gnu.org/onlinedocs/gcc-4.8.1/gcc/Modifiers.html#Modifiers), the use of more than one commutative pair may fail or yield an ICE. — Brett Hale, Sep 19 '13 at 10:13
I'm curious to find an example where the above simple code will fail ... but then, in newer compilers, you can always use builtin `int128_t` support (gcc 4.x, Intel ICC >= 10, current MSVC all support it), and just use `__uint128_t x = a + b;` ... — FrankH., Sep 20 '13 at 09:05
I would be surprised if this is true in practice - the asm section of the manual is often out of date, like the prohibition on the `"+m"` constraint, which was only corrected in the documentation recently. — Brett Hale, Sep 20 '13 at 10:21

Brett Hale · Answer 2 · 2013-09-21T13:36:55.330

Have a look at the longlong.h header included in project like GMP and GCC. You will find macros like:

#define add_ssaaaa(sh, sl, ah, al, bh, bl) \
  __asm__ ("addq %5,%q1\n\tadcq %3,%q0"                                 \
           : "=r" (sh), "=&r" (sl)                                      \
           : "0"  ((UDItype)(ah)), "rme" ((UDItype)(bh)),               \
             "%1" ((UDItype)(al)), "rme" ((UDItype)(bl)))

which should be easy enough to turn into an inline function with the __uint128_t type. You might want to add something like: __attribute__ ((__always_inline__)) to force inlining, regardless of the compiler flags.

Furthermore, have you looked at the code generated for the expression: a + b? I would expect it to yield the add/adc instruction pair you want, which was part of the motivation for this extended type.

Here's what a u128 x u64 -> u128 function call yields (gcc-4.8.1) :

    imulq   %rdx, %rsi
    movq    %rdx, %rax
    mulq    %rdi
    addq    %rsi, %rdx
    ret

And u128 x u128 -> u128 :

imulq   %rdx, %rsi
movq    %rdi, %rax
imulq   %rdi, %rcx
mulq    %rdx
addq    %rcx, %rsi
addq    %rsi, %rdx
ret

Yes, the direct use of 128bit ints, `c = a + b`, creates pretty much the same code (`add` followed by `adc`). — FrankH., Sep 20 '13 at 09:07
Actually, I have, and a+b is almost the only thing gcc does efficiently. This question is not primarily about a+b but more about inline assembly. Have you looked at a*b with a u128 and b u64? I have and that moment I decided to NOT use GCC's arithmetic support for u128 but do it by hand. Thx for your answer anyways. About the inline tip: If the user compile with -fno-inline, he normally has a reason to do so, I don't want to interfere with that. — Bodo Thiesen, Sep 20 '13 at 18:49
@BodoThiesen - I fail to see what's so bad about the compiler-generated code. — Brett Hale, Sep 21 '13 at 13:42

score 0 · Answer 3 · answered Dec 08 '13 at 22:51

Not helpful for GCC, but maybe someone with CLANG might be happy about this finding here: http://clang.llvm.org/docs/LanguageExtensions.html

This allows you to implement what you want without the need to know the target Assembler. I couldn't find anything like this for GCC though :(

Trying to implement 128 bit add in amd64 with inline assembly with multiple alternative constraints

3 Answers3