Vector weak SIMD Performance

Question

I am optimizing an algorithm and I am considering using Vector over double for a multiply and accumulate operation. The implementation the closest is obviously a Vector.dot(v1, v2);... BUT, why is my code so slow?

namespace ConsoleApp1 {
    class Program {
        public static double SIMDMultAccumulate(double[] inp1, double[] inp2) {

            var simdLength = Vector<double>.Count;
            var returnDouble = 0d;

            // Find the max and min for each of Vector<ushort>.Count sub-arrays 
            var i = 0;
            for (; i <= inp1.Length - simdLength; i += simdLength) {
                var va = new Vector<double>(inp1, i);
                var vb = new Vector<double>(inp2, i);
                returnDouble += Vector.Dot(va, vb);
            }

            // Process any remaining elements
            for (; i < inp1.Length; ++i) {
                var va = new Vector<double>(inp1, i);
                var vb = new Vector<double>(inp2, i);
                returnDouble += Vector.Dot(va, vb);
            }

            return returnDouble;
        }


        public static double NonSIMDMultAccumulate(double[] inp1, double[] inp2) {
            var returnDouble = 0d;

            for (int i = 0; i < inp1.Length; i++) {
                returnDouble += inp1[i] * inp2[i];
            }

            return returnDouble;
        }

        static void Main(string[] args) {
            Console.WriteLine("Is hardware accelerated: " + Vector.IsHardwareAccelerated);

            const int size = 24;
            var inp1 = new double[size];
            var inp2 = new double[size];

            var random = new Random();
            for (var i = 0; i < inp1.Length; i++) {
                inp1[i] = random.NextDouble();
                inp2[i] = random.NextDouble();
            }

            var sumSafe = 0d;
            var sumFast = 0d;

            var sw = Stopwatch.StartNew();
            for (var i = 0; i < 10; i++) {
                sumSafe =  NonSIMDMultAccumulate(inp1, inp2);
            }
            Console.WriteLine("{0} Ticks", sw.Elapsed.Ticks);

            sw.Restart();
            for (var i = 0; i < 10; i++) {
                sumFast = SIMDMultAccumulate(inp1, inp2);
            }
            Console.WriteLine("{0} Ticks", sw.Elapsed.Ticks);

//            Assert.AreEqual(sumSafe, sumFast, 0.00000001);
        }
    }

}

The SIMD version needs around 70% more ticks compared to the nonSIMD version. I am running a Haswell architecture and imho. FMA3 should be implemented! (Release build, x64 prefered).

Any ideas? Thanks guys!

Benchmark fail, times are completely dominated by jitting overhead. Hard to measure, this is extremely fast code. Put a for(;;) loop around it to run it 10 times to get rid of the jitting overhead and get a feel for the variability. Pick a larger number than 24 to see any real improvement. No FMA in the current code generator btw. — Hans Passant, Jul 07 '18 at 17:57
I see your point, changed parameters around and got better results, but for my case it is probably not the right approach! — Styp, Jul 07 '18 at 18:23
Use a true benchmark framework please, https://github.com/dotnet/BenchmarkDotNet Like Hans mentioned, only if you get rid of overhead the testing becomes useful. — Lex Li, Jul 08 '18 at 04:30
I will try to do the BenchmarkDotNet thingy as soon as I have a some sprae time... — Styp, Jul 08 '18 at 12:35
@harold deleted his answer, but it makes the important point that you should accumulate a vector of results, instead of narrowing to scalar inside the inner loop. And use multiple accumulators to hide FP latency. — Peter Cordes, Jul 08 '18 at 23:48
Is the implementation even correct? I don't really know C#, but the way you construct `new Vector(inp1, i);` the same in the main and the remain-loop does not look right. Also, does C# optimize away the construction of `Vector`? — chtz, Jul 10 '18 at 07:17

score 3 · Answer 1 · edited Apr 09 '19 at 20:34

Using BechmarkDotNet, I get almost double performance with SIMD Vector assuming that the input arrays have length (ITEMS = 10000) that is a multiple of Vector.Count:

    [Benchmark(Baseline = true)]
    public double DotDouble()
    {
        double returnVal = 0.0;
        for(int i = 0; i < ITEMS; i++)
        {
            returnVal += doubleArray[i] * doubleArray2[i];
        }
        return returnVal;
    }

    [Benchmark]
    public double DotDoubleVectorNaive()
    {
        double returnVal = 0.0;
        for(int i = 0; i < ITEMS; i += doubleSlots)
        {
           returnVal += Vector.Dot(new Vector<double>(doubleArray, i), new Vector<double>(doubleArray2, i));
        }
        return returnVal;  
    }

    [Benchmark]
    public double DotDoubleVectorBetter()
    {
        Vector<double> sumVect = Vector<double>.Zero;
        for (int i = 0; i < ITEMS; i += doubleSlots)
        {
            sumVect += new Vector<double>(doubleArray, i) * new Vector<double>(doubleArray2, i);
        }
        return Vector.Dot(sumVect, Vector<double>.One);
    }

    BenchmarkDotNet=v0.10.14, OS=Windows 10.0.17134
    Intel Core i7-4500U CPU 1.80GHz (Haswell), 1 CPU, 4 logical and 2 physical cores
    Frequency=1753758 Hz, Resolution=570.2041 ns, Timer=TSC
    .NET Core SDK=2.1.300
      [Host]     : .NET Core 2.1.0 (CoreCLR 4.6.26515.07, CoreFX 4.6.26515.06), 64bit RyuJIT
      DefaultJob : .NET Core 2.1.0 (CoreCLR 4.6.26515.07, CoreFX 4.6.26515.06), 64bit RyuJIT


                Method |      Mean |     Error |    StdDev | Scaled |
---------------------- |----------:|----------:|----------:|-------:|
             DotDouble | 10.341 us | 0.0902 us | 0.0844 us |   1.00 |
  DotDoubleVectorNaive |  5.907 us | 0.0206 us | 0.0183 us |   0.57 |
 DotDoubleVectorBetter |  4.825 us | 0.0197 us | 0.0184 us |   0.47 |

For completeness, RiuJIT will compile the Vector.Dot product on Haswell to:

vmulpd  ymm0,ymm0,ymm1            
vhaddpd ymm0,ymm0,ymm0    
vextractf128 xmm2,ymm0,1                
vaddpd  xmm0,xmm0,xmm2              
vaddsd  xmm6,xmm6,xmm0

Edited to add case with Dot product outside loop as per comment and ASm for dot product..

You're still using `Vector.Dot()` inside the inner loop. That's nasty; use a vector accumulator and do the horizontal sum at the end. — Peter Cordes, Jul 12 '18 at 00:01
Agrred. The point was to show the OP that his benchmarking was pro´lly off by a lot. WIll edit and add the best case. — C. Gonzalez, Jul 12 '18 at 21:02
Or maybe the OP is on an AMD CPU, where `dppd` sucks even more than on Intel, assuming that's what `.Dot()` compiles / JITs to. (One per 3 clocks on Ryzen, vs. 1 per clock on your Haswell: http://agner.org/optimize/). Although unless the loop is unrolled with multiple accumulators, FP-add latency should still be the bottleneck. — Peter Cordes, Jul 12 '18 at 21:18

Vector weak SIMD Performance

1 Answers1

Linked