I'm writing some performance sensitive code, where multiplication of unsigned 64-bit integers (ulong
) is a bottleneck.
.NET Core 3.0 beings access to hardware intrinsics with the System.Runtime.Intrinsics
namespace, which is fantastic.
I'm currently using a portable implementation that returns a tuple of the high and low bits of the 128-bit result:
[MethodImpl(MethodImplOptions.AggressiveInlining)]
internal static unsafe (ulong Hi, ulong Lo) Multiply64(ulong x, ulong y)
{
ulong hi;
ulong lo;
lo = x * y;
ulong x0 = (uint)x;
ulong x1 = x >> 32;
ulong y0 = (uint)y;
ulong y1 = y >> 32;
ulong p11 = x1 * y1;
ulong p01 = x0 * y1;
ulong p10 = x1 * y0;
ulong p00 = x0 * y0;
// 64-bit product + two 32-bit values
ulong middle = p10 + (p00 >> 32) + (uint)p01;
// 64-bit product + two 32-bit values
hi = p11 + (middle >> 32) + (p01 >> 32);
return (hi, lo);
}
I want to make this faster using intrinsics. I'm clear on how to use BMI2 when available (this is ~50% faster than the portable version):
ulong lo;
ulong hi = System.Runtime.Intrinsics.X86.Bmi2.X64.MultiplyNoFlags(x, y, &lo);
return (hi, lo);
I'm totally unclear on how to use the other intrinsics that are available; they all seem to rely on the Vector<128>
type, and none of them seem to deal with the ulong
type.
How can I implement multiplication of ulong
s using SSE, AVX etc?