15

Modern CPU's can perform extended multiplication between two native-size words and store the low and high result in separate registers. Similarly, when performing division, they store the quotient and the remainder in two different registers instead of discarding the unwanted part.

Is there some sort of portable gcc intrinsic which would take the following signature:

void extmul(size_t a, size_t b, size_t *lo, size_t *hi);

Or something like that, and for division:

void extdiv(size_t a, size_t b, size_t *q, size_t *r);

I know I could do it myself with inline assembly and shoehorn portability into it by throwing #ifdef's in the code, or I could emulate the multiplication part using partial sums (which would be significantly slower) but I would like to avoid that for readability. Surely there exists some built-in function to do this?

Thomas
  • 3,321
  • 1
  • 21
  • 44
  • But when optimizing with `-O3` GCC will probably emit the right processor-specific extended multiplication or division (perhaps even using both modulus & remainder, if your program use both), so I would not bother about that in your code. – Basile Starynkevitch Nov 02 '12 at 00:37
  • @BasileStarynkevitch But you'd need 128-bit integer-support - which GCC does not have for x64. There's no way you can "trick" GCC into emitting a 64x64 -> 128-bit multiply or a 128-bit/64-bit -> 64-bit divide using just 64-bit integers in C. – Mysticial Nov 02 '12 at 00:41
  • If you need 128-bit, gcc might have some extended type like `__int128` or something that could get you the effects you want without cpu-specific code. For 64-bit (on 32-bit machines), just using `long long` works fine. – R.. GitHub STOP HELPING ICE Nov 02 '12 at 00:43
  • @R.. Unfortunately, [`__int128` only appears to be supported if the hardware actually has 128-bit integers.](http://gcc.gnu.org/onlinedocs/gcc/_005f_005fint128.html) – Mysticial Nov 02 '12 at 00:46
  • Won't [`std::div`](http://en.cppreference.com/w/cpp/numeric/math/div) take advantage of that hardware instruction anyway? – Kerrek SB Nov 02 '12 at 00:47
  • 1
    FWIW, MSVC has an intrinsic for the extended multiply that you're looking for (64x64 ->128). In GCC x64, I just use an inline assembly macro. – Mysticial Nov 02 '12 at 00:48
  • @KerrekSB std::div sounds promising (it's C++ but that's acceptable). I would expect it to make full use of hardware instructions as well. But.. of course.. there is no std::mul. – Thomas Nov 02 '12 at 00:49
  • 1
    @Mystical The "integer mode" in the link you gave does not correspond to the native integer width, it is rather the "machine mode". Most 64 bit targets do actually support QImode, which is 128 bit. Especially on x86_64 128bit ints work just fine. – Gunther Piez Nov 02 '12 at 01:23

2 Answers2

22

For gcc since version 4.6 you can use __int128. This works on most 64 bit hardware. For instance

To get the 128 bit result of a 64x64 bit multiplication just use

void extmul(size_t a, size_t b, size_t *lo, size_t *hi) {
    __int128 result = (__int128)a * (__int128)b;
    *lo = (size_t)result;
    *hi = result >> 64;
}

On x86_64 gcc is smart enough to compile this to

   0:   48 89 f8                mov    %rdi,%rax
   3:   49 89 d0                mov    %rdx,%r8
   6:   48 f7 e6                mul    %rsi
   9:   49 89 00                mov    %rax,(%r8)
   c:   48 89 11                mov    %rdx,(%rcx)
   f:   c3                      retq   

No native 128 bit support or similar required, and after inlining only the mul instruction remains.

Edit: On a 32 bit arch this works in a similar way, you need to replace __int128_t by uint64_t and the shift width by 32. The optimization will work on even older gccs.

Gunther Piez
  • 29,760
  • 6
  • 71
  • 103
  • So is there really no way to make this work in general for any word size (not just 64-bit) without resorting to a bunch of #ifdef's? If not, I will accept this answer.. – Thomas Nov 02 '12 at 01:01
  • This will work on a 32 bit arch in a similar way if you you need the two 32 bit parts of the 64 bit result. You need to make the shift width 32 bit - I am sure there is an appropriate macro somewhere defined in the stdlib – Gunther Piez Nov 02 '12 at 01:03
  • 2
    @Thomas What is probably missing though is something like `dsize_t` which would be a double width `size_t` type - like the `__int128` or `uint64_t` on 64/32 bit arches. – Gunther Piez Nov 02 '12 at 01:11
  • Wow, I'm surprised. Because I swear I tried `__int128` before (on x64). And it always said that it wasn't supported. Maybe I was using an older GCC or I was missing a header. – Mysticial Nov 02 '12 at 02:40
  • You need a recent GCC, e.g. 4.7, to get good enough `__int128`; and you may need additional compiling flags, perhaps `-std=gnu11` – Basile Starynkevitch Nov 02 '12 at 07:28
  • @BasileStarynkevitch I am quite sure I used it for this exact purpose on an older gcc, maybe even 4.5, but I don't know exactly. No additional flags are required though, maybe because there is no collision with the C namespace. – Gunther Piez Nov 02 '12 at 08:13
  • 1
    @BasileStarynkevitch Just tested it, gcc-4.5 does not know 128bit ints, gcc-4.6 does. – Gunther Piez Nov 02 '12 at 08:20
  • 1
    For completeness: Intel ICC version 13.0 recognizes `__int128` (and/or `__int128_t`) as well, ICC 12.1 and below doesn't know it. – FrankH. Nov 09 '12 at 11:29
  • For ARM Cortex M3, `unsigned q32mul(unsigned a, unsigned b) { return ((unsigned long long)a * b) >> 32; }` becomes `umull r0, r1, r1, r0; mov r0, r1`, but the `mov` is unnecessary (`umull r1, r0, r1, r0` is preferable). – Aksel Dec 20 '13 at 21:32
  • Do you need to cast both? You could save repetition and allow promotion rules to promote the other operand of the multiply to 128 bit implicitly, right? The trick here is that the compiler realizes that neither side of the multiply actually needs to be 128 bits at the start, so it does the intended 64bitx64bit=128bit. – doug65536 Jun 07 '19 at 02:03
9

For those wondering about the other half of the question (division), gcc does not provide an intrinsic for that because the processor division instructions don't conform to the standard.

This is true both with 128-bit dividends on 64-bit x86 targets and 64-bit dividends on 32-bit x86 targets. The problem is that DIV will cause divide overflow exceptions in cases where the standard says the result should be truncated. For example (unsigned long long) (((unsigned _int128) 1 << 64) / 1) should evaluate to 0, but would cause divide overflow exception if evaluated with DIV.

(Thanks to @ross-ridge for this info)

kleptog
  • 631
  • 6
  • 9
  • 2
    This is a reason for *not optimizing 64/32 into idiv*, not a reason for *not providing an intrinsic*. There are many `__builtin` functions of GCC that invokes undefined behavior with some arguments. – user202729 Jan 11 '20 at 02:35
  • It's more because the 64 bit mul instruction returns the results in _two_ registers, one for high part and one for the low part, the arguments are still 64 bit (it's similar for 32bit CPUs with 64 bit result). For division the dividend and divisor would have to be passed in as two registers, and there is no instruction like that (there is partial one for dividing 128 bit with 64 bit, at least on x86_64). – Hubert Kario Jul 25 '22 at 19:26