Custom SSE computation with .NET Hardware intrinsics slow

Question

I tried to implement my own Vector struct with the new hardware intrinsics support in .NET Core 3 using C#. This is the struct I wrote:

[StructLayout(LayoutKind.Sequential)]
public struct Vector4
{
    public float X;
    public float Y;
    public float Z;
    public float W;

    public Vector4(float x, float y, float z, float w)
    {
        X = x;
        Y = y;
        Z = z;
        W = w;
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static float Dot(Vector4 a, Vector4 b)
    {
        return a.X * b.X + a.Y * b.Y + a.Z * b.Z + a.W * b.W;
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public unsafe static float DotSse(Vector4 a, Vector4 b)
    {
        var _a_128 = Sse.LoadVector128((float*)&a);
        var _b_128 = Sse.LoadVector128((float*)&b);
        var _result_128 = Sse41.DotProduct(_a_128, _b_128, 0xF1);
        float result;
        Sse.StoreScalar(&result, _result_128);
        return result;
    }
}

However, after doing some Benchmarks using BenchmarkDotNet it turned out that my SSE version of the dot product is much slower than without:

    [Benchmark]
    [ArgumentsSource(nameof(Vectors))]
    public void Dot(Vector4 a, Vector4 b)
    {
        Vector4.Dot(a, b);
    }

    [Benchmark]
    [ArgumentsSource(nameof(Vectors))]
    public void DotSse(Vector4 a, Vector4 b)
    {
        Vector4.DotSse(a, b);
    }

Result on my notebook with Intel i7-9750H

| Method |           a |           b |      Mean |     Error |    StdDev |
|------- |------------ |------------ |----------:|----------:|----------:|
|    Dot | SSE.Vector4 | SSE.Vector4 | 0.0466 ns | 0.0258 ns | 0.0201 ns |
| DotSse | SSE.Vector4 | SSE.Vector4 | 0.6555 ns | 0.0286 ns | 0.0254 ns |

Now I was wondering if I did something wrong with my implementation of DotSse?

sse4.1 `dpps` is not the fastest instruction ever. It's actually slower on Ice Lake than on your Skylake-derived CPU, so it's not a good future-proof building block. Anyway, it's 4 uops, 3 for the FMA ports, on your CPU (https://uops.info/). Your pure scalar version could be compiled to 1 mul and 3 FMAs so all 4 uops have to run on port 0 or port 1, so you'd expect `dpps` to be somewhat faster. Especially since it avoids having to load each scalar element separately, in case there's a front-end bottleneck. — Peter Cordes, Nov 23 '20 at 02:11
Oh wait a minute, `0.046` ns? Nope, nothing can be that fast; that tells you that your loop optimized away, probably because you do nothing with the result of the computation. 1 clock cycle on a 4GHz CPU is 0.25ns. [Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987). If you increase the repeat count for whatever you're doing, I expect you'd get a different "mean" — Peter Cordes, Nov 23 '20 at 02:13

Custom SSE computation with .NET Hardware intrinsics slow

0 Answers0