You can make a single operation with 3+3 doubles this way
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static Vector3D Substract(Vector3D left, Vector3D right)
{
Vector256<double> v0 = Vector256.Create(left.X, left.Y, left.Z, 0);
Vector256<double> v1 = Vector256.Create(right.X, right.Y, right.Z, 0);
Vector256<double> result = Avx.Subtract(v0, v1);
return new Vector3D(result.GetElement(0), result.GetElement(1), result.GetElement(2));
}
MethodImplOptions.AggressiveInlining
tells compiler to embed method's code to the caller's body (if possible). No method call will be in output assembly but only calculations.
It might be faster but your test has 2 issues.
- Don't check
Avx2.IsSupported
per operation, do it once per application life.
- Don't create data in the loop, memory allocations makes the test slow and dirty.
Clean test can look like this
[Benchmark]
public void SubtractBenchMarkAccelerated()
{
Vector3D vector1 = new Vector3D(1.5, 2.5, 3.5);
Vector3D vector2 = new Vector3D(0.1, 0.2, 0.3);
for (int i = 0; i < 1000000; i++)
{
Subtract(vector1, vector2);
}
}
But the problem of the single operation if using only 75% of Vector256
capacity. Can it be faster by 25%? Yes, with more data.
That was only start of the story. Let's imagine that you want to calculate 4 groups of vectors at once. 4 against 4. Where the performance magic starts.
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static Vector3D[] SubstractArray(Vector3D[] left, Vector3D[] right)
{
var v0 = MemoryMarshal.Cast<Vector3D, Vector256<double>>(left);
var v1 = MemoryMarshal.Cast<Vector3D, Vector256<double>>(right);
Vector3D[] result = new Vector3D[left.Length];
var r = MemoryMarshal.Cast<Vector3D, Vector256<double>>(result);
for (int i = 0; i < v0.Length; i++) // v0.Length = 3 here, not 4
{
r[i] = Avx.Subtract(v0[i], v1[i]);
}
return result;
}
MemoryMarshal.Cast
doesn't copy anything, it just make Span<T>
that pointing to the same memory as source array, thus it's lightning-fast. I tested.
The test can look like this.
[Benchmark]
public void SubtractBenchMarkAccelerated4()
{
Vector3D[] array1 = new Vector3D[4];
array1[0] = new Vector3D(1.5, 2.5, 3.5);
array1[1] = new Vector3D(1.5, 2.5, 3.5);
array1[2] = new Vector3D(1.5, 2.5, 3.5);
array1[3] = new Vector3D(1.5, 2.5, 3.5);
Vector3D[] array2 = new Vector3D[4];
array2[0] = new Vector3D(0.1, 0.2, 0.3);
array2[1] = new Vector3D(0.1, 0.2, 0.3);
array2[2] = new Vector3D(0.1, 0.2, 0.3);
array2[3] = new Vector3D(0.1, 0.2, 0.3);
for (int i = 0; i < 1000000; i++)
{
SubstractArray(array1, array2);
}
}
Calculate 4000000 of vectors in the ~same time as 1000000, why not? You can calculate any amount of vectors this way. Just be sure that doubles count % 4 == 0
.
Can it be faster than above example? Yes but with unsafe code only.
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static unsafe Vector3D[] SubstractArrayUnsafe(Vector3D[] left, Vector3D[] right)
{
var v0 = MemoryMarshal.Cast<Vector3D, Vector256<double>>(left);
var v1 = MemoryMarshal.Cast<Vector3D, Vector256<double>>(right);
Vector3D[] result = new Vector3D[left.Length];
var r = MemoryMarshal.Cast<Vector3D, Vector256<double>>(result);
fixed (Vector256<double>* vPtr0 = v0, vPtr1 = v1, rPtr = r)
{
Vector256<double>* endPtr0 = vPtr0 + v0.Length;
Vector256<double>* vPos0 = vPtr0;
Vector256<double>* vPos1 = vPtr1;
Vector256<double>* rPos = rPtr;
while (vPos0 < endPtr0)
{
*rPos = Avx.Subtract(*vPos0, *vPos1);
vPos0++;
vPos1++;
rPos++;
}
}
return result;
}
You can substract this way not only Vector3D[]
but your Vector4D[]
or simply double[]
arrays.
Also visit these useful pages: x86/x64 SIMD Instruction List (SSE to AVX512) and this one.
UPDATE
Optimizing single operation for packages of the same size
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static Vector4DHW Substract(ref Vector4DHW left, ref Vector4DHW right)
{
var left1 = Unsafe.As<Vector4DHW, Vector256<double>>(ref left);
var right1 = Unsafe.As<Vector4DHW, Vector256<double>>(ref right);
var result = Avx.Subtract(left1, right1);
return Unsafe.As<Vector256<double>, Vector4DHW>(ref result);
}
Let's Benchmark
class Program
{
static void Main()
{
var summary = BenchmarkRunner.Run<MyBenchmark>();
Console.ReadKey();
}
}
[StructLayout(LayoutKind.Sequential)]
public struct Vector4DHW
{
public double X;
public double Y;
public double Z;
public double W;
public Vector4DHW(double x, double y, double z, double w)
{
X = x;
Y = y;
Z = z;
W = w;
}
}
public class MyBenchmark
{
private Vector4DHW vector1 = new Vector4DHW(1.5, 2.5, 3.5, 4.5);
private Vector4DHW vector2 = new Vector4DHW(0.1, 0.2, 0.3, 0.4);
[Benchmark]
public void Loop()
{
for (int i = 0; i < 1000000; i++)
{
var j = i;
}
}
[Benchmark]
public void Substract()
{
for (int i = 0; i < 1000000; i++)
{
var result = Substract(ref vector1, ref vector2);
}
}
[Benchmark]
public void SubstractAvx()
{
for (int i = 0; i < 1000000; i++)
{
var result = SubstractAvx(ref vector1, ref vector2);
}
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static Vector4DHW Substract(ref Vector4DHW left, ref Vector4DHW right)
{
return new Vector4DHW(left.X - right.X, left.Y - right.Y, left.Z - right.Z, left.W - right.W);
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static Vector4DHW SubstractAvx(ref Vector4DHW left, ref Vector4DHW right)
{
var left1 = Unsafe.As<Vector4DHW, Vector256<double>>(ref left);
var right1 = Unsafe.As<Vector4DHW, Vector256<double>>(ref right);
var result = Avx.Subtract(left1, right1);
return Unsafe.As<Vector256<double>, Vector4DHW>(ref result);
}
}
Go!
BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
Intel Core i7-4700HQ CPU 2.40GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.102
[Host] : .NET Core 5.0.2 (CoreCLR 5.0.220.61120, CoreFX 5.0.220.61120), X64 RyuJIT
DefaultJob : .NET Core 5.0.2 (CoreCLR 5.0.220.61120, CoreFX 5.0.220.61120), X64 RyuJIT
| Method | Mean | Error | StdDev |
|------------- |-----------:|--------:|--------:|
| Loop | 317.6 us | 1.36 us | 1.21 us |
| Substract | 1,427.0 us | 4.14 us | 3.46 us |
| SubstractAvx | 478.0 us | 1.58 us | 1.40 us |
As conclusion I can tell that memory optimisations are really important when you're trying to save few more microseconds in performance. And even Stack allocations are important regardless of its lightning-fast speed. Finally the for
loop overhead consumes a lot of time from 478
microseconds. That's why I measured the Loop
overhead separately.
Let's calculate the performance gain from AVX.
1,427.0 - 317.6 = 1109.4
478.0 - 317.6 = 160.4
1109.4 / 160.4 = 6.92
AVX is almost 7 times faster.
UPDATE2
Also test this
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public unsafe static Vector4DHW Substract(Vector4DHW left, Vector4DHW right)
{
var result = Avx.Subtract(*(Vector256<double>*)&left, *(Vector256<double>*)&right);
return *(Vector4DHW*)&result;
}