7

I'm writing some performance sensitive code, where multiplication of unsigned 64-bit integers (ulong) is a bottleneck.

.NET Core 3.0 beings access to hardware intrinsics with the System.Runtime.Intrinsics namespace, which is fantastic.

I'm currently using a portable implementation that returns a tuple of the high and low bits of the 128-bit result:

[MethodImpl(MethodImplOptions.AggressiveInlining)]
internal static unsafe (ulong Hi, ulong Lo) Multiply64(ulong x, ulong y)
{
    ulong hi;
    ulong lo;

    lo = x * y;

    ulong x0 = (uint)x;
    ulong x1 = x >> 32;

    ulong y0 = (uint)y;
    ulong y1 = y >> 32;

    ulong p11 = x1 * y1;
    ulong p01 = x0 * y1;
    ulong p10 = x1 * y0;
    ulong p00 = x0 * y0;

    // 64-bit product + two 32-bit values
    ulong middle = p10 + (p00 >> 32) + (uint)p01;

    // 64-bit product + two 32-bit values
    hi = p11 + (middle >> 32) + (p01 >> 32);

    return (hi, lo);
}

I want to make this faster using intrinsics. I'm clear on how to use BMI2 when available (this is ~50% faster than the portable version):

ulong lo;
ulong hi = System.Runtime.Intrinsics.X86.Bmi2.X64.MultiplyNoFlags(x, y, &lo);
return (hi, lo);

I'm totally unclear on how to use the other intrinsics that are available; they all seem to rely on the Vector<128> type, and none of them seem to deal with the ulong type.

How can I implement multiplication of ulongs using SSE, AVX etc?

Cocowalla
  • 13,822
  • 6
  • 66
  • 112
  • What about the [`Vector`](https://learn.microsoft.com/en-us/dotnet/api/system.numerics.vector-1?view=netstandard-2.1)? Using the System.Numerics.Vectors. But it seems that the ulong with multiplication doesn't give any advances. – Jeroen van Langen May 07 '19 at 09:38
  • @J.vanLangen that's kind of my question :) None of the intrinsic multiply functions seem to work with `ulong`, e.g. `Sse2.Multiply` – Cocowalla May 07 '19 at 09:40

1 Answers1

4

SIMD vectors aren't single wide integers. The max element width is 64-bit. They're for processing multiple elements in parallel.

x86 doesn't have any instructions for 64x64 => 128-bit SIMD-element multiply, not even with AVX512DQ. (That does provide SIMD 64x64 => 64-bit multiply though, for 2, 4, or 8 elements in parallel.)

AVX512IFMA (in Cascade Lake) has 52-bit high and low-half multiply-accumulate (not a coincidence that's the significand width of double; SIMD integer multiply instructions use the same multiply hardware as FP).


So if you wanted 64x64 => 128-bit SIMD multiply, you'd have to synthesize it out of 4x 32x32 => 64-bit vpmuludq and some adds, including an add-width carry which you'd again have to synthesize from multiple instructions.

This would probably be slower than scalar mul r64 for an array of multiplies even with AVX512 available. It only takes 4 scalar mul instructions to produce 512 bits of multiply results, and modern x86 CPUs fully pipeline mul so they can produce 1 pair of results per clock. (Of course, store throughput is only 1 per clock until IceLake / Sunny Cove, so getting both halves of the 64-bit result stored is a problem! But moving data to XMM registers for 128-bit stores costs more uops and also hits a 64-bit-per-clock bottleneck.)

If you only needed 64x64 => 64-bit multiply, you could drop the high32*high32 multiply. I wrote a C++ version of that in Fastest way to multiply an array of int64_t? and it is barely faster than scalar on Haswell with AVX2, but significantly faster on Skylake. Either way, it would be nowhere near worth it without AVX2.


And BTW, you don't need BMI2 to do scalar 64x64 => 128-bit multiplies.

That's baseline for x86-64, with either one-operand mul (unsigned) or imul (signed). If C# exposes an intrinsic for BMI2 mulx, it surely must expose one for plain unsigned mul and signed imul which are at least as efficient in most cases (and smaller code size).

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1
    I appreciate the detailed response - as you have guessed, I'm new to intrinsics :) I do need both the high and low bits of the 64x64 128-bit result. If `mul` and `imul` are standard x86, then presumably dotnet uses those already if you multiply 2 64-bit integers (looking at a decompilation, it uses a `mul` opcode, which I guess is a 1:1 match with the x86 opcode), but in any case, .NET doesn't have a 128-bit data type (yet), so that would only get me the low bits of the result. – Cocowalla May 07 '19 at 14:03
  • strange enough [there's no intrinsic for `mulx` in .NET core](https://learn.microsoft.com/en-us/dotnet/api/system.runtime.intrinsics.x86?view=netcore-3.0). So it seems the only way is to write native code calling [`_mul128`](https://learn.microsoft.com/en-us/cpp/intrinsics/mul128?view=vs-2019) or [`_umul128`](https://learn.microsoft.com/en-us/cpp/intrinsics/umul128?view=vs-2019). Or maybe edit and compile [.NET Core](https://github.com/dotnet/corefx/tree/master/src/Common/src/CoreLib/System/Runtime/Intrinsics/X86) on one's own – phuclv May 08 '19 at 02:33
  • 2
    there are various issues related to 128-bit `mul` support: [Improve System.Decimal performance for x64 platform](https://github.com/dotnet/coreclr/issues/10642), [Enable long multiply](https://github.com/dotnet/coreclr/pull/7065), https://github.com/dotnet/coreclr/pull/21362#discussion_r239273064, https://github.com/dotnet/corefx/issues/32075#issuecomment-420467575 – phuclv May 08 '19 at 02:44
  • @phuclv: Then what's `System.Runtime.Intrinsics.X86.Bmi2.X64.MultiplyNoFlags` that the OP describes in the question? BMI2 multiply NoFlags pretty uniquely describes `mulx`. (I don't know anything about C# libraries / infrastructure or what provides what, so maybe you mean that it's not actually part of .NET Core? There's a reason my answer doesn't contain any C# code or mention of any actual C# intrinsics, just what the instruction-sets and hardware are capable of. :P) Or did you mean that there are no intrinsics for plain `mul` and `imul`, *only* `mulx`? – Peter Cordes May 08 '19 at 02:45
  • 2
    @PeterCordes my typo. I mean there's no intrinsic for `mul` or `imul` in .NET core. Only `mulx` – phuclv May 08 '19 at 02:48
  • 2
    @phuclv: Well that's dumb. Not all C compilers have one, but the major ones that don't have intrinsics instead support 128-bit integer types so can emit it from `a * (__int128)b`. But there was some time when 64-bit platforms existed before a `__int128` type appeared in GCC, so I guess .NET core is at that stage of its evolution? – Peter Cordes May 08 '19 at 02:52
  • @PeterCordes I think the trouble comes from MSVC. Due to some unknown reasons have `_mul128` for years but not `_div128` [until the end of las year](https://developercommunity.visualstudio.com/content/problem/315434/unable-to-perform-12864-bit-divide-on-x64.html). .NET core cannot evolve faster than MSVC itself – phuclv May 08 '19 at 04:04
  • The intellisense docs for the .NET BMI `MultiplyNoFlags` function state `_mulx_x64` – Cocowalla May 08 '19 at 08:16
  • @PeterCordes MSVC doesn't have GCC's __uint128_t type, but it does have [`_umul128`](https://learn.microsoft.com/en-us/cpp/intrinsics/umul128?view=vs-2019) - AFAIK this has been around for a long time, so presumably behind the scenes it can use intrinsics other than BMI2 – Cocowalla May 08 '19 at 08:19
  • @Cocowalla: That's correct, MSVC's C++ intrinsics for 64x64 => 128-bit multiply, `_mul128` and `_umul128` , expose the baseline x86-64 instructions `imul` and `mul`. (If you were compiling with BMI2 enabled, MSVC could choose to use `mulx` for `_umul128`, if enabling BMI2 is even a thing). I think you meant to say it can use *instructions* other than BMI2. They *are* intrinsics, and aren't implemented in terms of other intrinsics. So it's silly for Microsoft to have a `mulx` intrinsic but not `mul` and `imul` intrinsics. Especially since `imul` is the only choice for signed multiply. – Peter Cordes May 08 '19 at 08:28
  • @PeterCordes yes, that's what I meant, I'm just mangling the terminology somewhat :) I just checked the ASM generated by the JIT for `int/long` and `uint/ulong` multiplication, and it does indeed compile to 'base' x86 `mul` and `imul` instructions. For `ulong` though, that will of course only get the low bits of the result. – Cocowalla May 08 '19 at 08:31
  • @PeterCordes BTW, if you want to try this for yourself, I used [SharpLab](https://sharplab.io) to check the ASM generated by the JIT – Cocowalla May 08 '19 at 08:32