C# Improve performance of SIMD Sum

Question

I'm writing a SIMD library and trying to squeeze every bit of performance.
I'm already casting in-place the array into a Span<Vector<int>>, instead of creating new objects.
Target arrays are of large size (more than 1000 elements).
Is there a more efficient way of summing an array?
Ideas are welcome.

    public static int Sum(int[] array)
    {
        Vector<int> vSum = Vector<T>.Zero;
        int sum;
        int i;

        Span<Vector<int>> vsArray = MemoryMarshal.Cast<int, Vector<int>>(array);

        for (i = 0; i < vsArray.Length; i++)
        {
            vSum += vsArray[i];
        }

        sum = Vector.Dot(vSum, Vector<int>.One);

        i *= Vector<int>.Count;

        for (; i < array.Length; i++)
        {
            sum += array[i];
        }

        return sum;
    }

Unless C# optimizes away the multiply in a dot-product with `1`, that's not efficient. It's outside the loop so it's pretty minor, but still needs a vector constant for no reason. [Fastest way to do horizontal SSE vector sum (or other reduction)](https://stackoverflow.com/a/35270026) shows efficient horizontal sums with C intriniscs. — Peter Cordes, Sep 17 '20 at 19:54
Other than that, this is fairly reasonable. You might want to unroll the loop by hand if C# doesn't do that for you (with multiple vector accumulators) to hide latency an allow a throughput of 2 vector loads+adds per clock if data is hot in cache. But I don't have a C# dev setup to see what asm you actually get out of this. — Peter Cordes, Sep 17 '20 at 19:54
Have a look at https://www.youtube.com/watch?v=mFZIj2y3Le0 . Maybe it helps. — Efthymios Kalyviotis, Sep 17 '20 at 19:58

score 4 · Accepted Answer · answered Sep 18 '20 at 01:59

Your code is good. Only possible to improve by 4%, here's how:

// Test result: only 4% win on my PC.
[MethodImpl( MethodImplOptions.AggressiveInlining )]
static int sumUnsafeAvx2( int[] array )
{
    unsafe
    {
        fixed( int* sourcePointer = array )
        {
            int* pointerEnd = sourcePointer + array.Length;
            int* pointerEndAligned = sourcePointer + ( array.Length - array.Length % 16 );
            Vector256<int> sumLow = Vector256<int>.Zero;
            Vector256<int> sumHigh = sumLow;
            int* pointer;
            for( pointer = sourcePointer; pointer < pointerEndAligned; pointer += 16 )
            {
                var a = Avx.LoadVector256( pointer );
                var b = Avx.LoadVector256( pointer + 8 );
                sumLow = Avx2.Add( sumLow, a );
                sumHigh = Avx2.Add( sumHigh, b );
            }
            sumLow = Avx2.Add( sumLow, sumHigh );
            Vector128<int> res4 = Sse2.Add( sumLow.GetLower(), sumLow.GetUpper() );
            res4 = Sse2.Add( res4, Sse2.Shuffle( res4, 0x4E ) );
            res4 = Sse2.Add( res4, Sse2.Shuffle( res4, 1 ) );
            int scalar = res4.ToScalar();
            for( ; pointer < pointerEnd; pointer++ )
                scalar += *pointer;
            return scalar;
        }
    }
}

Here's a complete test.

To be clear, I don’t recommend doing what I wrote above. Not for the 4% improvement. Unsafe code is, well, unsafe. Your version will work without AVX2, and benefits from AVX512 if available, my optimized version gonna crash without AVX2, and won’t use AVX512 even if hardware supports it.

Your code also potentially spends more time in scalar cleanup, if the element count `% 16` is high. 128-bit vector cleanup could leave at most 0..3 elements, in parallel with reducing 2x 256-bit vectors to 128. (Also, your scalar cleanup loop could be written to sum into a separate 0-initialized variable, allowing some ILP benefit for the adds as well as the loads, if the compiler doesn't do that transformation for you.) — Peter Cordes, Sep 18 '20 at 05:08
Please note that te way you benchmark the code isn't that precise. If you want to benchmark the code in the most precise way possible then you can use BenchmarkDotNet at: https://github.com/dotnet/BenchmarkDotNet — Jan Tamis Kossen, Jan 05 '21 at 20:01
@Soonts BenchmarkDotNet can measure time in nanoseconds while the stopwatch class can't do that. In addition BenchmarkDotNet can show results from multiple tests. — Jan Tamis Kossen, Jan 07 '21 at 10:22
@JanTamisKossen Nanoseconds don’t matter for that tests, the time being measured is way larger. Stopwatch uses QueryPerformanceCounter under the hood, on modern PCs the resolution of that timer is typically 10MHz. — Soonts, Jan 07 '21 at 10:35

C# Improve performance of SIMD Sum

1 Answers1