Implementing C# hardware intrinsics wrapper issue

Question

I am trying to use the power of hardware intrinsic and just for test create one function based on Avx2 instructions and compare it with my current Vector implementation with no intrinsic at all.

When I did a benchmark of 2 functions doing the same, I was impressed that intrinsics function actually 2 times slower. I investigate that and found out that calculations itself ~3.8 times faster, but when I starting to create my wrapper structure and return the result, it actually where the most time is spend.

Here is my implementations for intrinsic method:

public static Vector4FHW Subtract(Vector4FHW left, Vector4FHW right)
{
    if (Avx2.IsSupported)
    {
        var left1 = Vector128.Create(left.X, left.Y, left.Z, left.W);
        var right1 = Vector128.Create(right.X, right.Y, right.Z, right.W);
        var result = Avx2.Subtract(left1, right1);
 
        var x = result.GetElement(0);
        var y = result.GetElement(1);
        var z = result.GetElement(2);
        var w = result.GetElement(3);
 
        return new Vector4FHW(x, y, z, w);
    }

    return default;
}

And here is my naive implementation of old Vector:

public static void Subtract(ref Vector3F left, ref Vector3F right, out Vector3F result)
{
   result = new Vector3F(left.X - right.X, left.Y - right.Y, left.Z - right.Z);
}

I made benchmark with BenchmarkDotNet where I call Subtract 1 000 000 times and here is my results:

With HW support I have ~3170 us, without - 970 us

And my main question: what I am doing wrong that creating C# struct with values takes soooo long comparing to my old implementation and/or I can do some additional optimizations here?

UPDATE

My Vector4FHW and Vector3F actually the same structures. They looks like this:

[StructLayout(LayoutKind.Sequential)]
public struct Vector4FHW
{
    public float X;

    public float Y;

    public float Z;

    public float W;

    public Vector4FHW(float x, float y, float z, float w)
    {
        X = x;
        Y = y;
        Z = z;
            W = w;
    }
    //...
}

And here is my tests. They also pretty simple:

[Benchmark]
public void SubtractBenchMarkAccelerated()
{
    for (int i = 0; i < 1000000; i++)
    {
        Vector4FHW.Subtract(new Vector4FHW(1, 20, 60, 15),new Vector4FHW(20, 48, 79, 19));
    }
}

[Benchmark]
public void SubtractBenchMark()
{
    for (int i = 0; i < 1000000; i++)
    {
        Vector4F.Subtract(new Vector4F(1, 20, 60, 15), new Vector4F(20, 48, 79, 19));
    }
}

Have you checked this ? https://stackoverflow.com/questions/3395873/performance-of-pass-by-value-vs-pass-by-reference-in-c-sharp-net — Mauricio Gracia Gutierrez, Feb 07 '21 at 19:25
It's easy to fix 1) What is `Vector4FHW`, can you show its code? 2) Can you show the testing code? Are you operating on the large array of `Vector4FHW` structures? — aepot, Feb 07 '21 at 19:28
3) What is `Vector3D`, can you show its code? What the format of the original data and the desired output format? Looks like you are performing redundant data type conversions. The pill for code is as less conversions as possible, as less memory allocations as possible. — aepot, Feb 07 '21 at 19:36
Final questions: .NET 5? Is `Vector3D` = `System.Windows.Media.Media3D.Vector3D`? — aepot, Feb 07 '21 at 20:04
@aepot Yes, its Net 5, but all Vector structures are mine (except Vector128). I`ve edited my question. Hope that helps — Denis, Feb 07 '21 at 20:05
How many `Vector4F` in real code do you want to compute at one loop? Is `Vector4F` is source of your original data? Is `float` is native foor your code? Or double if better? It's important for performance. What if you make zero conversions for the data, what would be format? `Vector3D`? What the real puprose of the code? XYZW - is that final structure format? Or maybe you have just an array of doubles/floats? — aepot, Feb 07 '21 at 20:08
Actually I dont measure how many I want to compute at once, but my Collision math is based on Vector3F/D. Currently I am using doubles for collisions in 2d/triangulation because it have better precision and better results. VectorF structs also using in Vector math calculations (I am developing game engine) — Denis, Feb 07 '21 at 20:14
@Denis let's operate with `System.Windows.Media.Media3D.Vector3D` then? — aepot, Feb 07 '21 at 20:15
Vector structs contains a lot of methods for different calculations like transformations/Subtraction/Dot/Cross and others. That why I prefer using my own versions of these data structures — Denis, Feb 07 '21 at 20:22
@Denis AVX2 have a lot of methods too. OK, 4 doubles per package is the best for you? — aepot, Feb 07 '21 at 20:24
@aepot most of the time I am using 3 doubles, but as far as we can use inly 2 or 4, 4 is ok — Denis, Feb 07 '21 at 20:25
@Denis you can use 3 if you want, any. I can to it for `Vector3D` with doubles as well. Is it fine? I just want to show you how can it be fast. — aepot, Feb 07 '21 at 20:27
@aepot I think 3 doubles is perfectly ok. I am looking forward to see your solution — Denis, Feb 07 '21 at 20:37

aepot · Accepted Answer · 2021-02-11T20:12:14.267

You can make a single operation with 3+3 doubles this way

[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static Vector3D Substract(Vector3D left, Vector3D right)
{
    Vector256<double> v0 = Vector256.Create(left.X, left.Y, left.Z, 0);
    Vector256<double> v1 = Vector256.Create(right.X, right.Y, right.Z, 0);
    Vector256<double> result = Avx.Subtract(v0, v1);
    return new Vector3D(result.GetElement(0), result.GetElement(1), result.GetElement(2));
}

MethodImplOptions.AggressiveInlining tells compiler to embed method's code to the caller's body (if possible). No method call will be in output assembly but only calculations.

It might be faster but your test has 2 issues.

Don't check Avx2.IsSupported per operation, do it once per application life.
Don't create data in the loop, memory allocations makes the test slow and dirty.

Clean test can look like this

[Benchmark]
public void SubtractBenchMarkAccelerated()
{
    Vector3D vector1 = new Vector3D(1.5, 2.5, 3.5);
    Vector3D vector2 = new Vector3D(0.1, 0.2, 0.3);
    for (int i = 0; i < 1000000; i++)
    {
        Subtract(vector1, vector2);
    }
}

But the problem of the single operation if using only 75% of Vector256 capacity. Can it be faster by 25%? Yes, with more data.

That was only start of the story. Let's imagine that you want to calculate 4 groups of vectors at once. 4 against 4. Where the performance magic starts.

[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static Vector3D[] SubstractArray(Vector3D[] left, Vector3D[] right)
{
    var v0 = MemoryMarshal.Cast<Vector3D, Vector256<double>>(left);
    var v1 = MemoryMarshal.Cast<Vector3D, Vector256<double>>(right);
    Vector3D[] result = new Vector3D[left.Length];
    var r = MemoryMarshal.Cast<Vector3D, Vector256<double>>(result);

    for (int i = 0; i < v0.Length; i++) // v0.Length = 3 here, not 4
    {
        r[i] = Avx.Subtract(v0[i], v1[i]);
    }

    return result;
}

MemoryMarshal.Cast doesn't copy anything, it just make Span<T> that pointing to the same memory as source array, thus it's lightning-fast. I tested.

The test can look like this.

[Benchmark]
public void SubtractBenchMarkAccelerated4()
{
    Vector3D[] array1 = new Vector3D[4];
    array1[0] = new Vector3D(1.5, 2.5, 3.5);
    array1[1] = new Vector3D(1.5, 2.5, 3.5);
    array1[2] = new Vector3D(1.5, 2.5, 3.5);
    array1[3] = new Vector3D(1.5, 2.5, 3.5);
    Vector3D[] array2 = new Vector3D[4];
    array2[0] = new Vector3D(0.1, 0.2, 0.3);
    array2[1] = new Vector3D(0.1, 0.2, 0.3);
    array2[2] = new Vector3D(0.1, 0.2, 0.3);
    array2[3] = new Vector3D(0.1, 0.2, 0.3);
    for (int i = 0; i < 1000000; i++)
    {
        SubstractArray(array1, array2);
    }
}

Calculate 4000000 of vectors in the ~same time as 1000000, why not? You can calculate any amount of vectors this way. Just be sure that doubles count % 4 == 0.

Can it be faster than above example? Yes but with unsafe code only.

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static unsafe Vector3D[] SubstractArrayUnsafe(Vector3D[] left, Vector3D[] right)
{
    var v0 = MemoryMarshal.Cast<Vector3D, Vector256<double>>(left);
    var v1 = MemoryMarshal.Cast<Vector3D, Vector256<double>>(right);
    Vector3D[] result = new Vector3D[left.Length];
    var r = MemoryMarshal.Cast<Vector3D, Vector256<double>>(result);

    fixed (Vector256<double>* vPtr0 = v0, vPtr1 = v1, rPtr = r)
    {
        Vector256<double>* endPtr0 = vPtr0 + v0.Length;
        Vector256<double>* vPos0 = vPtr0;
        Vector256<double>* vPos1 = vPtr1;
        Vector256<double>* rPos = rPtr;
        while (vPos0 < endPtr0)
        {
            *rPos = Avx.Subtract(*vPos0, *vPos1);
            vPos0++;
            vPos1++;
            rPos++;
        }
    }
    return result;
}

You can substract this way not only Vector3D[] but your Vector4D[] or simply double[] arrays.

Also visit these useful pages: x86/x64 SIMD Instruction List (SSE to AVX512) and this one.

UPDATE

Optimizing single operation for packages of the same size

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static Vector4DHW Substract(ref Vector4DHW left, ref Vector4DHW right)
{
    var left1 = Unsafe.As<Vector4DHW, Vector256<double>>(ref left);
    var right1 = Unsafe.As<Vector4DHW, Vector256<double>>(ref right);
    var result = Avx.Subtract(left1, right1);
    return Unsafe.As<Vector256<double>, Vector4DHW>(ref result);
}

Let's Benchmark

class Program
{
    static void Main()
    {
        var summary = BenchmarkRunner.Run<MyBenchmark>();
        Console.ReadKey();
    }
}

[StructLayout(LayoutKind.Sequential)]
public struct Vector4DHW
{
    public double X;

    public double Y;

    public double Z;

    public double W;

    public Vector4DHW(double x, double y, double z, double w)
    {
        X = x;
        Y = y;
        Z = z;
        W = w;
    }
}

public class MyBenchmark
{
    private Vector4DHW vector1 = new Vector4DHW(1.5, 2.5, 3.5, 4.5);
    private Vector4DHW vector2 = new Vector4DHW(0.1, 0.2, 0.3, 0.4);

    [Benchmark]
    public void Loop()
    {
        for (int i = 0; i < 1000000; i++)
        {
            var j = i;
        }
    }

    [Benchmark]
    public void Substract()
    {
        for (int i = 0; i < 1000000; i++)
        {
            var result = Substract(ref vector1, ref vector2);
        }
    }

    [Benchmark]
    public void SubstractAvx()
    {
        for (int i = 0; i < 1000000; i++)
        {
            var result = SubstractAvx(ref vector1, ref vector2);
        }
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static Vector4DHW Substract(ref Vector4DHW left, ref Vector4DHW right)
    {
        return new Vector4DHW(left.X - right.X, left.Y - right.Y, left.Z - right.Z, left.W - right.W);
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static Vector4DHW SubstractAvx(ref Vector4DHW left, ref Vector4DHW right)
    {
        var left1 = Unsafe.As<Vector4DHW, Vector256<double>>(ref left);
        var right1 = Unsafe.As<Vector4DHW, Vector256<double>>(ref right);
        var result = Avx.Subtract(left1, right1);
        return Unsafe.As<Vector256<double>, Vector4DHW>(ref result);
    }
}

Go!

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
Intel Core i7-4700HQ CPU 2.40GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.102
  [Host]     : .NET Core 5.0.2 (CoreCLR 5.0.220.61120, CoreFX 5.0.220.61120), X64 RyuJIT
  DefaultJob : .NET Core 5.0.2 (CoreCLR 5.0.220.61120, CoreFX 5.0.220.61120), X64 RyuJIT

|       Method |       Mean |   Error |  StdDev |
|------------- |-----------:|--------:|--------:|
|         Loop |   317.6 us | 1.36 us | 1.21 us |
|    Substract | 1,427.0 us | 4.14 us | 3.46 us |
| SubstractAvx |   478.0 us | 1.58 us | 1.40 us |

As conclusion I can tell that memory optimisations are really important when you're trying to save few more microseconds in performance. And even Stack allocations are important regardless of its lightning-fast speed. Finally the for loop overhead consumes a lot of time from 478 microseconds. That's why I measured the Loop overhead separately.

Let's calculate the performance gain from AVX.

1,427.0 - 317.6 = 1109.4
478.0 - 317.6 = 160.4
1109.4 / 160.4 = 6.92

AVX is almost 7 times faster.

UPDATE2

Also test this

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public unsafe static Vector4DHW Substract(Vector4DHW left, Vector4DHW right)
{
    var result = Avx.Subtract(*(Vector256<double>*)&left, *(Vector256<double>*)&right);
    return *(Vector4DHW*)&result;
}

Shouldn't the result of calling Subtract *in the benchmark* be stored so that the jitter doesn't realize it can discard the results and the calculation as a whole? — pinkfloydx33, Feb 07 '21 at 22:20
@pinkfloydx33 good observation but with agressive inlining it has no effect. There's another issue, it's better to move data instantiation for benchmark totally to `[GlobalSetup]` and leave only calculations inside the `[Benchmark]`. — aepot, Feb 07 '21 at 22:23
@aepot Good catch! I replicated your tests and see that updated code is really faster. I think this kind of casting from Vector256 to my Vector struct I was really missing. Thanks for your help. I will mark this as correct answer — Denis, Feb 08 '21 at 13:44
@aepot But I would like also to know how to correctly deal with "IsSupported" flag. Should I store it to variable and check it each time when I call some AVX/AVX2 or SSE instructions? — Denis, Feb 08 '21 at 14:18
@Denis [check this](https://codereview.stackexchange.com/a/254289/226545) (`Apply` method). In short, make two derived from one interface classes and use as underlying engine one of them depending on `IsSupported` value. But be caredul with it (at least test the performance) because it can disable `AgressiveInlining` benefits. — aepot, Feb 08 '21 at 15:00
@Denis btw it's not applicable to `static` methods but you may create some `MyMath` class. — aepot, Feb 08 '21 at 15:13
@Denis confiremd, just benchmarked => 10x times slower. :( But you still can use this method for big functions. E.g. you run some complex function once but intinsified `AgressiveInlining` methods are located in the engine class and called from that function. So, you may select engine with that way but not small methods. As i did with `SobelOperator` by the link above. — aepot, Feb 08 '21 at 15:23
@aepot Just tested with Avx.IsSupported and without and actually there is no performance difference. Maybe 2-3 us on my PC — Denis, Feb 08 '21 at 16:26
@Denis it every time calles `CPUID` instruction. Which can be called once. Ok if you want to do it, you may cache it in the local `bool` field. `private static readonly bool IsAvx2Supported = Avx2.IsSupported`. But for performance i prefer implementation avoiding any `if` guarding statements inside the each function. — aepot, Feb 08 '21 at 16:29
@aepot Is there a way to gain performance impact from using Vector128 vs simple Vector2D? As far as I can see, both calculations have the same time = ~500us for million passes. Is it maximum that avx2 could give or there is some tricks for receiving more performance? — Denis, Feb 09 '21 at 14:18
@Denis you may test it, i don't know. There's a lot of tricks to gain performance such as complex instructions, integer calculations, compute more data per loop iteration, etc. etc. But all these optimisations are highly depends on the objective. For example, i linked above Sobel Operator solution. My first attemt to intrinsify it was 10 times slower than the final version. Also you may put calculations precision to get more performance in some cases. E.g. calc floats instead of doubles in terms of Vector capacity. And much more things. — aepot, Feb 09 '21 at 15:43
@aepot Didnt see any performance improvements for my CPU for the last scenario unfortunately( — Denis, Feb 28 '21 at 08:16

score 0 · Answer 2 · answered Feb 08 '21 at 09:52

@aepot OK, I took into account your remarks and played with it a little and here is my results:

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static Vector4DHW Subtract(Vector4DHW left, Vector4DHW right)
    {
        var left1 = Vector256.Create(left.X, left.Y, left.Z, 0);
        var right1 = Vector256.Create(right.X, right.Y, right.Z, 0);
        var result = Avx2.Subtract(left1, right1);

         return new Vector4DHW(result.GetElement(0), result.GetElement(1), 
         result.GetElement(2), result.GetElement(3));
    }

If I am using it like this, I receive ~2470 us

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static unsafe Vector4DHW Subtract(Vector4DHW left, Vector4DHW right)
    {
        var left1 = Vector256.Create(left.X, left.Y, left.Z, left.W);
        var right1 = Vector256.Create(right.X, right.Y, right.Z, left.W);
        var result = Avx2.Subtract(left1, right1);

        double* value = stackalloc double[4];
        Avx2.Store(value, result);
        return new Vector4DHW(value[0], value[1], value[2], value[3]);
    }

This variant gives me ~2089 us

Even if I make my method void and do no conversions at all like this:

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static void Subtract(Vector4DHW left, Vector4DHW right)
    {
        var left1 = Vector256.Create(left.X, left.Y, left.Z, left.W);
        var right1 = Vector256.Create(right.X, right.Y, right.Z, left.W);
        var result = Avx2.Subtract(left1, right1);
    }

It will gives me ~990 us. Its the only case when intrinsic functions faster, but I need to give user possibility to see results of the calculations

Calculations of pure Vector4D is near 1990 - 2000 us

So, I dont see any benefit of using intrinsics for such kind of calculations. Maybe in some cases it will be faster, but I think each situation should be considered separately

Implementing C# hardware intrinsics wrapper issue

2 Answers2