MSVC's instrinsics __emulu and _umul128 in GCC/CLang

Question

In MSVC there exist instrinsics __emulu() and _umul128(). First does u32*u32->u64 multiplication and second u64*u64->u128 multiplication.

Do same intrinsics exist for CLang/GCC?

Closest I found are _mulx_u32() and _mulx_u64() mentioned in Intel's Guide. But they produce mulx instruction which needs BMI2 support. While MSVC's intrinsics produce regular mul instruction. Also _mulx_u32() is not available in -m64 mode, while __emulu() and _umul128() both exist in 32 and 64 bit mode of MSVC.

You may try online 32-bit code and 64-bit code.

Of cause for 32-bit one may do return uint64_t(a) * uint64_t(b); (see it online) hoping that compiler will guess correctly and optimize to using u32*u32->u64 multiplication instead of u64*u64->u64. But is there a way to be sure about this? Not to rely on compiler's guess that both arguments are 32-bit (i.e. higher part of uint64_t is zeroed)? To have some intrinsics like __emulu() that make you sure about code.

There is __int128 in GCC/CLang (see code online) but again we have to rely on compiler's guess that we actually multiply 64-bit numbers (i.e. higher part of int128 is zeroed). Is there a way to be sure without compiler guessing, if there exist some intrinsics for that?

BTW, both uint64_t (for 32-bit) and __int128 (for 64-bit) produce correct mul instruction instead of mulx in GCC/CLang. But again we have to rely that compiler guesses correctly that higher part of uint64_t and __int128 is zeroed.

Of cause I can look into assembler code that GCC/Clang have optimized and guessed correctly, but looking at assembler once doesn't guarantee that same will happen always in all circumstances. And I don't know of a way in C++ to statically assert that compiler did correct guess about assembler instructions.

https://github.com/yuikns/intrin/blob/master/intrin_x86.h#L769 ? Took me less then 3 mins to find. `Do same intrinsics exist for CLang/GCC?` Did you read compilers documentation to check for yourself? ex. here: https://gcc.gnu.org/onlinedocs/gcc-11.1.0/gcc/x86-Built-in-Functions.html#x86-Built-in-Functions `Is there a way to be sure without compiler guessing through using some intrinsic?` No, the name "intrinsic" already means it's something compiler dependent. — KamilCuk, May 24 '21 at 12:41
@KamilCuk Thanks! Looks like a nice solution through assembler. Would be great if you post this assembler code for both `__emulu()` and `_umul128()` as an answer. I can accept it. — Arty, May 24 '21 at 12:43
@KamilCuk Also I don't see 128-bit version there. Do you know assembler code for 128-bit? — Arty, May 24 '21 at 12:44
@KamilCuk So if you know how to code this solution for 32 and 64 bit in GCC/Clang assembler, please post this asm as an answer to my question. — Arty, May 24 '21 at 12:46

score 2 · Accepted Answer · answered May 24 '21 at 12:54

2

You already have the answer. Use uint64_t and __uint128_t. No intrinsics needed. This is available with modern GCC and Clang for all 64-bit targets. See Is there a 128 bit integer in gcc?

#include <stdint.h>
typedef __uint128_t uint128_t;

// 32*32=64 multiplication
f(uint32_t a, uint32_t b) {
   uint64_t ab = (uint64_t)a * b;
}

//64*64=128 multiplication
f(uint64_t a, uint64_t b) {
    uint128_t ab = (uint128_t)a * b;
}

Note that the cast must be on the operands, or at least on one operand. Casting the result would not work since it would do the multiplication with the shorter type and extend the result.

But is there a way to be sure about this? Not to rely on compiler's guess

You get exactly the same guarantee as with compiler intrinsics: that the value of the result is correct. There are never any guarantees about optimization. Just because you used intrinsics doesn't guarantee that the compiler will emit the “obvious” assembly instruction. The only way to have this guarantee is to use inline assembly, and for a simple operation like this it's likely to hurt performance because it would restrict the ways in which the compiler optimizes register usage.

answered May 24 '21 at 12:54

Gilles 'SO- stop being evil'

104,111
38
209
254

As far as I know there is no restriction in standard that says that `uint64_t(a) * uint64_t(b)` should always do short multiplication in case when both `a` and `b` are 32-bit, i.e. when higher part of `uint64_t(a)` is zeroed. Compiler may easily do long arithmetics full multiplication. And I have very sensitive and performance hungry code in my library. That's why I need to be sure. Do you know GCC/CLang assembler code for doing strictly what I want and really be sure? Even if this will disallow other compiler's optimizations. – Arty May 24 '21 at 12:58
1

@Arty If your code is performance-sensitive, you need to benchmark your code (not some toy example) on your target processor (not a different model). Memory accesses cost a lot more than multiplication, so register allocation and code size of inner loops tend to matter more than arithmetic. – Gilles 'SO- stop being evil' May 24 '21 at 13:03
If compiler accidentally decides to do long arithmetics multiplication (multiple instructions) instead of single instruction `mul` multiplication then it will be worse in all cases no matter how user uses my library function, because long arithmetics will contain same `mul` for lower-word computation hence this long version will always be an over-set of short version hence always taking more time. And no, final program is not memory-bound, it's expected to use much more arithmetics computations rather than random memory access. – Arty May 24 '21 at 13:14
And my function is a library function, I don't have access to final program and don't have access even to final CPU architecture, hence I can't control and optimize final program, it will just use my library function and my responsibility is to make it as performant as possible and make sure that it always performant independently on compiler's guess. Only thing I know that final program is computation-bound and not memory-bound. – Arty May 24 '21 at 13:14
@arty If you do not know the target architecture, how do you know it would support `__int128`? Even with GCC, its support is limited. – Daniel Langr May 24 '21 at 13:18
@Arty For a library function you can't get a final answer. Unless you restrict the library to be used only with dynamic loading in environments without a JIT, it's possible that the library function will be inlined differently in different places. – Gilles 'SO- stop being evil' May 24 '21 at 13:22
@DanielLangr I know something, like that I should provide two solutions - one for 32-bit architecture and one for 64-bit. In case of 64-bit architecture I think I should be sure that `__int128` exists in GCC/CLang, not? Also as I told in my question `__int128` is just an example of solving my task, but what I actually wanted is to have some strict intrinsic doing just `64*64->128` multiplication. Exactly what __emulu() and _umul128() do. – Arty May 24 '21 at 13:24
@Gilles'SO-stopbeingevil' But if my 32-bit version will have signature `uint64_t Mul32x32to64(uint32_t a, uint32_t b)` does it matter how it would be inlined or not? If I write assembler code inside it how it may behave incorrectly in case of different styles of inlining? – Arty May 24 '21 at 13:27
@Arty So, basically, you want some "microlibrary" that would provide such multiplication with a focus on maximizing performance on different targets? I guess you might then need a lot of conditional-compilation branching based on the combination of a compiler and target processor architecture. – Daniel Langr May 24 '21 at 13:29
@Arty Whether it's inlined or not doesn't matter for correctness, but it definitely matters for performance. – Gilles 'SO- stop being evil' May 24 '21 at 13:30
@DanielLangr What about MSVC? If I use intrinsics __emulu() and _umul128() doesn't it automically solve the problem of best optimizations for all targets? If it does then I wanted to have same intrinsics for CLang/GCC. – Arty May 24 '21 at 13:31
@arty First, `_umul128` seems to be available only for x64, so there are no other targets. Second, how do you know that `_umul128` will be translated into a single multiplication instruction? I can't see such a guarantee in the docs. I think that if you want a single instruction for sure, you need to write it in the assembly. – Daniel Langr May 24 '21 at 13:34
@DanielLangr I wanted to use _umul128() for 64-bit library version and __emulu() for 32-bit. How it can at all happen that in 64-bit CPU _umul128() is translated to more than one instruction? Only if there is no 64x64->128 single instruction available on this CPU, in this case for me is alright to have multi-instruction version. – Arty May 24 '21 at 13:36
I still don't understand why do you think that `_umul128` with MSVC is better than `__uint128_t * __uint128_t` (suggested in this answer)? What additional guarantees gives you `_umul128`? – Daniel Langr May 24 '21 at 13:44
@DanielLangr _umul128 has two 64-bit inputs, while `__uint128_t * __uint128_t` has two 128 bit inputs. So _umul128 guarantees that it will not do long-arithmetics multiplication, there is no point of implementing such function. In other words _umul128 as having two 64-bit inputs KNOWS that higher parts are zeroed, while `__uint128_t * __uint128_t` as far as I understand directly knows that inputs are 128 bits (meaning that higher parts may not be 0). Only because compiler traces down all arguments like `__uint128(uint64_t(x))` it may know but only indirectly that higher 128-bit part is zero. – Arty May 24 '21 at 13:57
@DanielLangr As far as I know from standard `__uint128_t(uint64_t(x)) * __uint128_t(uint64_t(y))` is not forcing compiler to use only 64x64->128 instead of 128x128->128 multiplication. That is only types optimization stage that makes compiler think that it should do only truncated 64x64->128 multiplication. It may or may not do such optimization. – Arty May 24 '21 at 13:59
@Arty I see (finally ;), thanks for the explanation. I then think you really need to use assembler if you don't want to rely on compiler optimization abilities. – Daniel Langr May 24 '21 at 14:00
@Arty No, `_umul128` does not guarantee that it won't do 128-bit multiplications. It probably won't, but only to the same extent that `(__uint128_t)a * b` won't with GCC or Clang either. You seem to assume that the compiler will not look at the nature of the arguments when it decides how to compile the `*` operator. This is not how compilers work. – Gilles 'SO- stop being evil' May 24 '21 at 14:03
@Gilles'SO-stopbeingevil' I thought that if something is not restricted in C++ standard then you can't rely on it to be done in 100% cases. And I think in standard it is not said that `uint128(uint64(x)) * uint128(uint64(y))` should be always done through 64x64->128 optimization. I know that many good things are done by compilers. But would be great if I can be sure about some things, for example if 64x64->128 optimization can be restricted at least by some magical `static_assert()` then I would be happy. If my code doesn't compile if such optimization is not done, I'll be quite happy. – Arty May 24 '21 at 14:11
@Arty Nothing in either the C++ standard or the MSVC documentation states that `_umul128` will do such an optimization either. You have the same guarantees either way. If you think you know better than the compiler, use inline assembly. But it's unlikely to result in faster code. Compiler writers know better than programmers how to make code run fast. – Gilles 'SO- stop being evil' May 24 '21 at 14:56
@Arty: GNU C didn't bother to include an intrinsic because GCC (and other GNU C compilers like clang and ICC) don't suck, and are able to optimize `a * (uint32_t)b` into a single `mul` instruction on 32-bit targets. (Unlike MSVC, which is dumb but has an intrinsic.) Demo on [Getting the high part of 64 bit integer multiplication](https://stackoverflow.com/a/50958815) – Peter Cordes May 24 '21 at 23:52
@Arty: Another example with asm output of MSVC sucking at `a * (uint64_t)b` without the intrinsic, but clang doing much better. (To build a 64x64 => 128-bit multiply for 32-bit code: [\_umul128 on Windows 32 bits](https://stackoverflow.com/q/46870373) - GCC's missed optimizations are in not turning that C into `add`/`adc` the way clang does; it does still use widening mul.) – Peter Cordes May 24 '21 at 23:57
@Arty I forgot to mention another downside of assembly: it removes any possibility for the compiler to decide to use vector instructions. – Gilles 'SO- stop being evil' May 25 '21 at 09:03
@Gilles'SO-stopbeingevil' I wonder if there is any possibility to `static_assert()` saying "break compilation if generated code used more than one mul instruction"? – Arty May 25 '21 at 09:11
@Arty No. Not only because you can't assert on the generated code, but also because “generated code used more than one mul instruction” is not well-defined. For example, this may be true in some places where the code is inlined and false in others. Or the generated code might not use multiplication at all because one of the operands was a compile-time constant. Or the multiplication may be coalesced with something else in a vector instruction. – Gilles 'SO- stop being evil' May 25 '21 at 09:45

MSVC's instrinsics __emulu and _umul128 in GCC/CLang

1 Answers1