Intrinsics for 128 multiplication and division

Question

In x86_64 I know that the mul and div opp codes support 128 integers by putting the lower 64 bits in the rax and the upper in the rdx registers. I was looking for some sort of intrinsic to do this in the intel intrinsics guide and I could not find one. I am writing a big number library where the word size is 64 bits. Right now I am doing division by a single word like this.

int ubi_div_i64(ubigint_t* a, ubi_i64_t b, ubi_i64_t* rem)
{
    if(b == 0)
        return UBI_MATH_ERR;

    ubi_i64_t r = 0;

    for(size_t i = a->used; i-- > 0;)
    {

        ubi_i64_t out;
        __asm__("\t"
                "div %[d] \n\t"
                : "=a"(out), "=d"(r)
                : "a"(a->data[i]), "d"(r), [d]"r"(b)
                : "cc");
        a->data[i] = out;


        //ubi_i128_t top = (r << 64) + a->data[i];
        //r = top % b;
        //a->data[i] = top / b;
    }
    if(rem)
        *rem = r;

    return ubi_strip_leading_zeros(a);
}

It would be nice if I could use something in the x86intrinsics.h header instead of inline asm.

Since asm is already compiler specific, you might as well just use the `__int128` type which will automatically do what you want. — Jester, Sep 12 '15 at 15:58
Take a look at _mulx_u64. Looks like a perfect fit for your use, although it generates the mulx instruction which is present only on newer x86 processors. — , Sep 13 '15 at 00:57
Given the choice between architecture specific intrinsics and architecture specific assembly; the latter is better documented, better supported, more widely understood and easier to maintain (no need to guess what the compiler actually did). — Brendan, Sep 28 '15 at 11:07

score 2 · Answer 1 · answered Sep 12 '15 at 15:59

2

gcc has __int128 and __uint128 types.

Arithmetic with them should be using the right assembly instructions when they exist; I've used them in the past to get the upper 64 bits of a product, although I've never used it for division. If it's not using the right ones, submit a bug report / feature request as appropriate.

answered Sep 12 '15 at 15:59

I decompiled the code when I built it on -03. I was surprised that gcc was calling a function instead of inlining one when using the 128 bit division. Just seemed kind of slow. – chasep255 Sep 12 '15 at 16:16
5

@chasep255 GCC doesn't use the "expanded" form of DIV/IDIV because it would be non-conforming. This is true both with 128-bit dividends on 64-bit x86 targets and 64-bit dividends on 32-bit x86 targets. The problem is that DIV will cause divide overflow exceptions in cases where the standard says the result should be truncated. For example `(unsigned long long) (((unsigned _int128) 1 << 64) / 1)` should evaluate to 0, but would cause divide overflow exception if evaluated with DIV. – Ross Ridge Sep 12 '15 at 17:18

score 1 · Answer 2 · edited May 23 '17 at 11:44

Last I looked into it the intrinsic were in a state of flux. The main reason for the intrinsics in this case appears to be due to the fact that MSVC in 64-bit mode does not allow inline assembly.

With MSVC (and I think ICC) you can use _umul128 for mul and _mulx_u64 for mulx. These don't work in GCC , at least not GCC 4.9 (_umul128 is much older than GCC 4.9). I don't know if GCC plans to support these since you can get mul and mulx indirectly through __int128 (depending on your compile options) or directly through inline assembly.

__int128 works fine until you need a larger type and a 128-bit carry. Then you need adc, adcx, or adox and these are even more of a problem with intrinsics. Intel's documentation disagree's with MSVC and the compilers don't seem to produce adox yet with these intrinsics. See this question: _addcarry_u64 and _addcarryx_u64 with MSVC and ICC.

Inline assembly is probably the best solution with GCC (and probably even ICC).

Intrinsics for 128 multiplication and division

2 Answers2

Linked