_umul128 on Windows 32 bits

Question

In Visual C++, _umul128 is undefined when targeting 32-bit Windows.

How can two unsigned 64-bit integers be multiplied when targeting Win32? The solution only needs to work on Visual C++ 2017 targeting 32-bit Windows.

https://msdn.microsoft.com/en-us/library/windows/desktop/aa383711(v=vs.85).aspx — Michael Petch, Oct 22 '17 at 04:15
Do you need to do a lot of this? It could possibly be worth using SSE2 or AVX2, depending on what you want to do. Although probably not, because there's no SIMD add-with-carry. But still, see https://stackoverflow.com/questions/17863411/sse-multiplication-of-2-64-bit-integers for getting a 64-bit result. It would take significantly more instructions to get a 128b result, though. https://stackoverflow.com/questions/6738283/can-xmm-registers-be-used-to-do-any-128-bit-integer-math — Peter Cordes, Oct 22 '17 at 04:17
[SIMD signed with unsigned multiplication for 64-bit * 64-bit to 128-bit](https://stackoverflow.com/questions/28807341/simd-signed-with-unsigned-multiplication-for-64-bit-64-bit-to-128-bit), possible speedup for 32-bit code vs. using `adc` + `mul` (or BMI2 `mulx`) as a building blocks for large arrays of numbers if you have AVX2 or AVX512, but for one at a time scalar is probably better. — Peter Cordes, Oct 22 '17 at 04:24
Also see [SSE multiplication of 2 64-bit integers](https://stackoverflow.com/q/17863411/608639) and [Is it possible to use SSE and SSE2 to make a 128-bit wide integer?](https://stackoverflow.com/q/12200698/608639) — jww, Jan 09 '19 at 03:41

Peter Cordes · Answer 1 · 2017-10-25T06:37:39.683

This answer has a version of the xmrrig function from the other answer optimized for MSVC 32-bit mode. The original is fine with other compilers, especially clang.

I looked at MSVC's output for @Augusto's function, and it's really bad. Using __emulu for 32x32 => 64b multiplication improved it significantly (because MSVC is dumb and doesn't optimize uint64_t * uint64_t = uint64_t for the case where the inputs are known to really only be 32-bit, with the upper half zero). Other compilers (gcc and clang) generate a single mul instruction instead of calling a helper function. There are other problems with MSVC's code-gen for this that I don't know how to fix by tweaking the source, though. I think if you want good performance on that compiler, you'll have to use inline asm (or a separately-compiled asm function).

If you need more flexible arbitrary-precision (larger numbers), see GMPlib's low-level functions which have asm implementations, instead of trying to build a 256b multiply out of this __umul128. But if you need this exactly, then it's worth trying. Sticking with C++ enables constant-propagation and CSE optimizations that you wouldn't get with asm.

clang compiles this with no major problems, actually using adc for all the add-with-carry (except one which it saves with a setc instruction). MSVC branches on the carry checks, and just makes nasty code. GCC doesn't do a very good job either, with some branching on carry. (Because gcc doesn't know how to turn carry = sum < a into an adc, gcc bug 79173.)

IDK if MSVC or gcc support any add-with-carry intrinsics for 64-bit integers in 32-bit mode. _addcarry_u64 generates poor code with gcc anyway (in 64-bit mode), but ICC might do ok. IDK about MSVC.

If you want an asm implementation, I'd suggest using clang 5.0's output from this function. You could probably find some optimizations by hand, but it's certainly better than MSVCs. But of course most of the arguments in https://gcc.gnu.org/wiki/DontUseInlineAsm apply: blocking constant-propagation is a major problem if you ever multiply by something that inlining turns into a constant, or by a number with the upper half known to be zero.

Source + asm output for MSVC 32-bit and clang5.0 32-bit on Godbolt

nice code with clang. Kinda bad code with MSVC, but better than before. Kinda bad with gcc also (no change vs. other answer).

#include <stdint.h>

#ifdef _MSC_VER
#  include <intrin.h>
#else
// MSVC doesn't optimize 32x32 => 64b multiplication without its intrinsic
// But good compilers can just use this to get a single mul instruction
static inline
uint64_t __emulu(uint32_t x, uint32_t y) {
     return x * (uint64_t)y;
}
#endif

// This is still pretty ugly with MSVC, branching on the carry
//  and using XMM store / integer reload to zero a register!
// But at least it inlines 4 mul instructions
//  instead of calling a generic 64x64 => 64b multiply helper function
uint64_t __umul128(uint64_t multiplier, uint64_t multiplicand, 
    uint64_t *product_hi) 
{
    // multiplier   = ab = a * 2^32 + b
    // multiplicand = cd = c * 2^32 + d
    // ab * cd = a * c * 2^64 + (a * d + b * c) * 2^32 + b * d
    uint64_t a = multiplier >> 32;
    uint64_t b = (uint32_t)multiplier; // & 0xFFFFFFFF;
    uint64_t c = multiplicand >> 32;
    uint64_t d = (uint32_t)multiplicand; // & 0xFFFFFFFF;

    //uint64_t ac = __emulu(a, c);
    uint64_t ad = __emulu(a, d);
    //uint64_t bc = __emulu(b, c);
    uint64_t bd = __emulu(b, d);

    uint64_t adbc = ad + __emulu(b , c);
    uint64_t adbc_carry = (adbc < ad); // ? 1 : 0;
    // MSVC gets confused by the ternary and makes worse code than using a boolean in an integer context for 1 : 0

    // multiplier * multiplicand = product_hi * 2^64 + product_lo
    uint64_t product_lo = bd + (adbc << 32);
    uint64_t product_lo_carry = (product_lo < bd); // ? 1 : 0;
    *product_hi = __emulu(a , c) + (adbc >> 32) + (adbc_carry << 32) + product_lo_carry;

    return product_lo;
}

Make sure you only use this in 32-bit code. In 64-bit code, it fails to optimize to a single 64-bit mul instruction (which produces both 64-bit halves of the full result). Compilers that implement the GNU C++ extensions (clang, gcc, ICC) can use unsigned __int128 and get good code. e.g. a * (unsigned __int128)b produces a 128b result. (Example on Godbolt).

For `adbc_carry = (adbc < ad)` you might try `adbc_carry = !!(adbc < ad)`. The MSVC compiler recognizes the pattern and does the boolean to int conversion. I read about it on a Microsoft blog. — jww, Aug 21 '18 at 18:28
Useful answer, upvoted! There exists also [_addcarry_u64()](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=addcarry) intrinsic which adds with carry and returns carry bit, don't know how clever is MSVC in guessing to use it automatically, but using it explicitly may improve speed a bit. Also don't know if this intrinsic is really present on all 32-bit machines, but at least linked URL doesn't say any special requirement to CPU. — Arty, May 19 '21 at 03:32
@Arty: MSVC, clang, and GCC unfortunately don't define `_addcary_64` for 32-bit targets. (I checked on https://gcc.godbolt.org/z/9jMGbYKnr). No obvious reason they couldn't; adc/adc is easy to emit. But in general 64-bit scalar intrinsics are only defined for x86-64 target machines, which makes sense for `_lzcnt_u64` and so on where there's no super obvious emulation. I could have tried to use `_addcarry_u32`, but since I want two `uint64_t` halves that would be harder to use. And compilers other than ICC tend to do a bad job with `_addcarry` anyway, not keeping carry in CF. — Peter Cordes, May 19 '21 at 04:04

score 2 · Accepted Answer · answered Oct 25 '17 at 02:43

2

I found the following code (from xmrrig), which seems to do the job just fine:

static inline uint64_t __umul128(uint64_t multiplier, uint64_t multiplicand, 
    uint64_t *product_hi) 
{
    // multiplier   = ab = a * 2^32 + b
    // multiplicand = cd = c * 2^32 + d
    // ab * cd = a * c * 2^64 + (a * d + b * c) * 2^32 + b * d
    uint64_t a = multiplier >> 32;
    uint64_t b = multiplier & 0xFFFFFFFF;
    uint64_t c = multiplicand >> 32;
    uint64_t d = multiplicand & 0xFFFFFFFF;

    //uint64_t ac = a * c;
    uint64_t ad = a * d;
    //uint64_t bc = b * c;
    uint64_t bd = b * d;

    uint64_t adbc = ad + (b * c);
    uint64_t adbc_carry = adbc < ad ? 1 : 0;

    // multiplier * multiplicand = product_hi * 2^64 + product_lo
    uint64_t product_lo = bd + (adbc << 32);
    uint64_t product_lo_carry = product_lo < bd ? 1 : 0;
    *product_hi = (a * c) + (adbc >> 32) + (adbc_carry << 32) + product_lo_carry;

    return product_lo;
}

answered Oct 25 '17 at 02:43

Augusto

151
6

1

Unfortunately this compiles really poorly on MSVC. It uses a function-call to `__allmul` for uint64*uint64 => 64-bit multiplication even when the upper half of both inputs is a compile-time 0. i.e. when it could have used one `mul` instruction to do a 32x32 => 64b full-multiply. https://godbolt.org/g/53Qxyo. If there's an intrinsic for 32x32 => 64b multiply, it would make it a lot more efficient. clang compiles it fairly nicely; there's still a lot of work to do, but it uses the `adc` instruction for all the add-with-carry. MSVC uses some `adc`, but branches for others :/ – Peter Cordes Oct 25 '17 at 04:01
1

Also, make sure you *don't* use this in 64-bit code where the intrinsic is available. (I guess using the same name is a good way to ensure compile-time error if the compiler provides it). It compiles in 64-bit mode, but to crap instead of a single `mul` instruction. – Peter Cordes Oct 25 '17 at 04:02
If you need this to actually be fast in 32-bit code, maybe consider using http://gmplib.org/. They have a 32-bit asm implementation for multiply, but I don't see a special case for 2 limbs * 2 limbs = 4 limbs. IDK how fast the general-case function is when you pass it those sizes. https://gmplib.org/repo/gmp/file/tip/mpn/x86/mul_basecase.asm – Peter Cordes Oct 25 '17 at 04:04
1

@Peter Cordes You can fix the MSVC `__allmul` problem by changing the type of `a`-`d` to `uint32_t` and casting to `uint64_t` at the point of the multiply. MSVC uses `adc` for `uint64_t` arithmetic. I think the only way to make to generate `adc` for the explicit carries is with the `_addcarry_u##` intrinsics, as seen [here](/a/29229641/5670773). `_addcarry_u64` is not available in 32-bit code so you may have to roll your own 64-bit adds with two `uint32_t`s. – benrg Oct 11 '19 at 18:12
in MSVC you should use __emulu(a,b) instead a * b, for better compiler code generation – jenkas Jan 02 '22 at 19:48

_umul128 on Windows 32 bits

2 Answers2

Linked