I am trying to boost performance for a .NET Core library by utilizing System.Numerics to perform SIMD operations on float[]
arrays. System.Numerics
is a bit funky right now, and I'm having a hard time seeing how it can be beneficial. I understand that in order to see a performance boost with SIMD, it must be amortized over a large quantity of computation, but given how it is currently implemented, I can't figure out how to accomplish this.
Vector<float>
requires 8 float
values - no more, no less. If I want to perform SIMD operations on a group of values smaller than 8, I am forced to copy the values to a new array and pad the remainer with zeroes. If the group of values is greater than 8, I need to copy the values, pad with zeroes to ensure its length is aligned to a multiple of 8, and then loop over them. The length requirement makes sense, but accomodating for this seems like a good way to nullify any performance gain.
I have written a test wrapper class that takes care of the padding and alignment:
public readonly struct VectorWrapper<T>
where T : unmanaged
{
#region Data Members
public readonly int Length;
private readonly T[] data_;
#endregion
#region Constructor
public VectorWrapper( T[] data )
{
Length = data.Length;
var stepSize = Vector<T>.Count;
var bufferedLength = data.Length - ( data.Length % stepSize ) + stepSize;
data_ = new T[ bufferedLength ];
data.CopyTo( data_, 0 );
}
#endregion
#region Public Methods
public T[] ToArray()
{
var returnData = new T[ Length ];
data_.AsSpan( 0, Length ).CopyTo( returnData );
return returnData;
}
#endregion
#region Operators
public static VectorWrapper<T> operator +( VectorWrapper<T> l, VectorWrapper<T> r )
{
var resultLength = l.Length;
var result = new VectorWrapper<T>( new T[ l.Length ] );
var lSpan = l.data_.AsSpan();
var rSpan = r.data_.AsSpan();
var stepSize = Vector<T>.Count;
for( var i = 0; i < resultLength; i += stepSize )
{
var lVec = new Vector<T>( lSpan.Slice( i ) );
var rVec = new Vector<T>( rSpan.Slice( i ) );
Vector.Add( lVec, rVec ).CopyTo( result.data_, i );
}
return result;
}
#endregion
}
This wrapper does the trick. The calculations appear to be correct, and Vector<T>
doesn't complain about the input count of the elements. However, it is twice as slow as a simple range-based for loop.
Here's the benchmark:
public class VectorWrapperBenchmarks
{
#region Data Members
private static float[] arrayA;
private static float[] arrayB;
private static VectorWrapper<float> vecA;
private static VectorWrapper<float> vecB;
#endregion
#region Constructor
public VectorWrapperBenchmarks()
{
arrayA = new float[ 1024 ];
arrayB = new float[ 1024 ];
for( var i = 0; i < 1024; i++ )
arrayA[ i ] = arrayB[ i ] = i;
vecA = new VectorWrapper<float>( arrayA );
vecB = new VectorWrapper<float>( arrayB );
}
#endregion
[Benchmark]
public void ForLoopSum()
{
var aA = arrayA;
var aB = arrayB;
var result = new float[ 1024 ];
for( var i = 0; i < 1024; i++ )
result[ i ] = aA[ i ] + aB[ i ];
}
[Benchmark]
public void VectorSum()
{
var vA = vecA;
var vB = vecB;
var result = vA + vB;
}
}
And the results:
| Method | Mean | Error | StdDev |
|----------- |-----------:|---------:|---------:|
| ForLoopSum | 757.6 ns | 15.67 ns | 17.41 ns |
| VectorSum | 1,335.7 ns | 17.25 ns | 16.13 ns |
My processor (i7-6700k) does support SIMD hardware acceleration, and this is running in Release mode, 64-bit with optimizations enabled on .NET Core 2.2 (Windows 10).
I realize that the Array.CopyTo()
is likely a large part of what is killing performance, but it seems there is no easy way to have both padding/alignment and data sets that don't explicitly conform to Vector<T>
's specifications.
I'm rather new to SIMD, and I understand that the C# implementation is still in its early phase. However, I don't see a clear way to actually benefit from it, especially considering it is most beneficial when scaled to larger data sets.
Is there a better way to go about this?