Why isn't Avx.Multiply significantly faster than the * operator?

Question

I've created the following test method to understand how SSE and AVX work and what their benefits are. Now I'm actually very surprised to see that System.Runtime.Intrinsics.X86.Avx.Multiply is less than 5% faster compared to the traditional approach with the * operator.

I don't understand why this is. Would you please enlighten me?

I've put my benchmark results in the last line of the code examples.

(long TicksSse2, long TicksAlu) TestFloat()
{
    Vector256<float> x = Vector256.Create((float)255, (float)128, (float)64, (float)32, (float)16, (float)8, (float)4, (float)2);
    Vector256<float> y = Vector256.Create((float).5);

    Stopwatch timerSse = new Stopwatch();
    Stopwatch timerAlu = new Stopwatch();

    for (int cnt = 0; cnt < 100_000_000; cnt++)
    {
        timerSse.Start();
        var xx = Avx.Multiply(x, y);
        timerSse.Stop();

        timerAlu.Start();
        float a = (float)255 * (float).5;
        float b = (float)128 * (float).5;
        float c = (float)64 * (float).5;
        float d = (float)32 * (float).5;
        float e = (float)16 * (float).5;
        float f = (float)8 * (float).5;
        float g = (float)4 * (float).5;
        float h = (float)2 * (float).5;
        timerAlu.Stop();
    }

    return (timerSse.ElapsedMilliseconds, timerAlu.ElapsedMilliseconds);
    // timerSse = 1688ms; timerAlu = 1748ms. 
}

Even more drastically, I created the following test method for mass byte multiplication. This one is even slower using the SSE commands:

Vector128<byte> MultiplyBytes(Vector128<byte> x, Vector128<byte> y)
{
    Vector128<ushort> xAsUShort = x.AsUInt16();
    Vector128<ushort> yAsUShort = y.AsUInt16();

    Vector128<ushort> dstEven = Sse2.MultiplyLow(xAsUShort, yAsUShort);
    Vector128<ushort> dstOdd = Sse2.MultiplyLow(Sse2.ShiftRightLogical(xAsUShort, 8), Sse2.ShiftRightLogical(yAsUShort, 8));

    return Sse2.Or(Sse2.ShiftLeftLogical(dstOdd, 8), Sse2.And(dstEven, helper)).AsByte();
}

(long TicksSse2, long TicksAlu) TestBytes()
{
    Vector128<byte> x = Vector128.Create((byte)1, (byte)2, (byte)3, (byte)4, (byte)5, (byte)6, (byte)7, (byte)8, (byte)9, (byte)10, (byte)11, (byte)12, (byte)13, (byte)14, (byte)15, (byte)16);
    Vector128<byte> y = Vector128.Create((byte)2);

    Stopwatch timerSse = new Stopwatch();
    Stopwatch timerAlu = new Stopwatch();

    for (int cnt = 0; cnt < 100_000_000; cnt++)
    {
        timerSse.Start();
        var xx = MultiplyBytes(x, y);
        timerSse.Stop();

        timerAlu.Start();
        byte a = (byte)1 * (byte)2;
        byte b = (byte)2 * (byte)2;
        byte c = (byte)3 * (byte)2;
        byte d = (byte)4 * (byte)2;
        byte e = (byte)5 * (byte)2;
        byte f = (byte)6 * (byte)2;
        byte g = (byte)7 * (byte)2;
        byte h = (byte)8 * (byte)2;
        byte i = (byte)9 * (byte)2;
        byte j = (byte)10 * (byte)2;
        byte k = (byte)11 * (byte)2;
        byte l = (byte)12 * (byte)2;
        byte m = (byte)13 * (byte)2;
        byte n = (byte)14 * (byte)2;
        byte o = (byte)15 * (byte)2;
        byte p = (byte)16 * (byte)2;
        timerAlu.Stop();
    }

    return (timerSse.ElapsedMilliseconds, timerAlu.ElapsedMilliseconds);
    // timerSse = 3439ms; timerAlu = 1800ms
}

Why shouldn't the compiler remove all the multiplications of bytes a to p at compile time, since none of them are used? Because the ALU timer numbers are so high, this brings me to another question: are you running Debug build or Release build? — Thomas Weller, Feb 15 '23 at 11:07
What you wrote measures single operations, 100M times. A Stopwatch's resolution isn't fast enough to measure single CPU commands, so the difference is most likely due to rounding errors. To get correct results use BenchmarkDotNet and measure each case separately. To get quick&rough results, use two separate loops and measure the entire loop each time — Panagiotis Kanavos, Feb 15 '23 at 11:08
I expect that the byte multiplications are optimized away at compile time, so e.g. `byte n = (byte)14 * (byte)2;` becomes `byte n = (byte)28;` in the code that gets executed. And then it's also possible (esp. in Release mode) that the compiler sees that `n` etc. are not used, so it may remove the assignments altogether. — Peter B, Feb 15 '23 at 11:08
@PeterB or the operations are so fast the stopwatch returns a single tick each time — Panagiotis Kanavos, Feb 15 '23 at 11:08
You understand that the second calculation in release mode is actually no op because it is optimized out by the compiler (results are not used)? And even if result would be used compiler would just calculated the constant (so again - no op). — Guru Stron, Feb 15 '23 at 11:09
Please use proper tools for benchmarking like `BenchmarkDotNet` and proper approaches. — Guru Stron, Feb 15 '23 at 11:12
If you measure each loop separately you'll get a 25% difference which is still too low and probably meaningless. The values are hard-coded and never used, so the compiler eliminate the operations. If you check the generated IL in eg Sharplab.io you'll see there are no multiplications in Release mode — Panagiotis Kanavos, Feb 15 '23 at 11:26
@ThomasWeller My results are in debug mode, but the numbers don't change significantly in release mode. — André Reichelt, Feb 15 '23 at 11:32
@AndréReichelt the code is wrong and measures empty loops and single-tick operations. To get meaningful results allocate arrays with random numbers and actually store the results somewhere, to prevent code elimination — Panagiotis Kanavos, Feb 15 '23 at 11:34
Print `Stopwatch.Frequency` to see the actual resolution of `Stopwatch`. A 10M resolution on a 2GHZ machine means 200 operations can be performed in a single tick. — Panagiotis Kanavos, Feb 15 '23 at 11:40
benchmarking is hard, especially in JIT languages like C# or Java. You need to warm up, manage GC and do lots of other things to get meaningful result because otherwise the hotspot won't be optimized. I don't know if there's a similar question for C# but many things in Java in [How do I write a correct micro-benchmark in Java?](https://stackoverflow.com/q/504103/995714) apply here. And always measure in release mode, benchmarking debug code is silly — phuclv, Feb 15 '23 at 13:20
Thank you guys, I've posted my results with `BenchmarkDotNet` down below. — André Reichelt, Feb 15 '23 at 13:59

Panagiotis Kanavos · Answer 1 · 2023-02-15T11:39:39.533

The benchmark code isn't meaningful.

It tries to measure the duration of a single operation, 100M times, using a timer that simply doesn't have the resolution to measure single CPU operations. Any differences are due to rounding errors.

On my machine Stopwatch.Frequency returns 10_000_000. That's 10MHz, on a 2.7GHZ CPU.

A very crude test would be to repeat each operation 100M times in a loop and measure the entire loop :

timerSse.Start();
for (int cnt = 0; cnt < iterations; cnt++)
{
    var xx = Avx.Multiply(x, y);
}
timerSse.Stop();
timerAlu.Start();
for (int cnt = 0; cnt < iterations; cnt++)
{

    float a = (float)255 * (float).5;
    float b = (float)128 * (float).5;
    float c = (float)64 * (float).5;
    float d = (float)32 * (float).5;
    float e = (float)16 * (float).5;
    float f = (float)8 * (float).5;
    float g = (float)4 * (float).5;
    float h = (float)2 * (float).5;
}
timerAlu.Stop();

In that case the results show a significant difference:

TicksSse2 = 357384, TicksAlu = 474061

The SSE2 code is 75% of the floating point code. That's still not meaningful because the actual code isn't multiplying anything.

The compiler sees that the values are constant and the results never used and eliminates them. Checking the IL generated in Release mode in Sharplab.io shows this:

    // loop start (head: IL_005e)
        IL_0050: ldloc.0
        IL_0051: ldloc.1
        IL_0052: call valuetype [System.Runtime.Intrinsics]System.Runtime.Intrinsics.Vector256`1<float32> [System.Runtime.Intrinsics]System.Runtime.Intrinsics.X86.Avx::Multiply(valuetype [System.Runtime.Intrinsics]System.Runtime.Intrinsics.Vector256`1<float32>, valuetype [System.Runtime.Intrinsics]System.Runtime.Intrinsics.Vector256`1<float32>)
        IL_0057: pop
        IL_0058: ldloc.s 4
        IL_005a: ldc.i4.1
        IL_005b: add
        IL_005c: stloc.s 4

        IL_005e: ldloc.s 4
        IL_0060: ldarg.1
        IL_0061: blt.s IL_0050
    // end loop

    IL_0063: ldloc.2
    IL_0064: callvirt instance void [System.Runtime]System.Diagnostics.Stopwatch::Stop()
    IL_0069: ldloc.3
    IL_006a: callvirt instance void [System.Runtime]System.Diagnostics.Stopwatch::Start()
    IL_006f: ldc.i4.0
    IL_0070: stloc.s 5
    // sequence point: hidden
    IL_0072: br.s IL_007a
    // loop start (head: IL_007a)
        IL_0074: ldloc.s 5
        IL_0076: ldc.i4.1
        IL_0077: add
        IL_0078: stloc.s 5

        IL_007a: ldloc.s 5
        IL_007c: ldarg.1
        IL_007d: blt.s IL_0074
    // end loop

Why isn't Avx.Multiply significantly faster than the * operator?

1 Answers1