3

I am optimizing an algorithm and I am considering using Vector over double for a multiply and accumulate operation. The implementation the closest is obviously a Vector.dot(v1, v2);... BUT, why is my code so slow?

namespace ConsoleApp1 {
    class Program {
        public static double SIMDMultAccumulate(double[] inp1, double[] inp2) {

            var simdLength = Vector<double>.Count;
            var returnDouble = 0d;

            // Find the max and min for each of Vector<ushort>.Count sub-arrays 
            var i = 0;
            for (; i <= inp1.Length - simdLength; i += simdLength) {
                var va = new Vector<double>(inp1, i);
                var vb = new Vector<double>(inp2, i);
                returnDouble += Vector.Dot(va, vb);
            }

            // Process any remaining elements
            for (; i < inp1.Length; ++i) {
                var va = new Vector<double>(inp1, i);
                var vb = new Vector<double>(inp2, i);
                returnDouble += Vector.Dot(va, vb);
            }

            return returnDouble;
        }


        public static double NonSIMDMultAccumulate(double[] inp1, double[] inp2) {
            var returnDouble = 0d;

            for (int i = 0; i < inp1.Length; i++) {
                returnDouble += inp1[i] * inp2[i];
            }

            return returnDouble;
        }

        static void Main(string[] args) {
            Console.WriteLine("Is hardware accelerated: " + Vector.IsHardwareAccelerated);

            const int size = 24;
            var inp1 = new double[size];
            var inp2 = new double[size];

            var random = new Random();
            for (var i = 0; i < inp1.Length; i++) {
                inp1[i] = random.NextDouble();
                inp2[i] = random.NextDouble();
            }

            var sumSafe = 0d;
            var sumFast = 0d;

            var sw = Stopwatch.StartNew();
            for (var i = 0; i < 10; i++) {
                sumSafe =  NonSIMDMultAccumulate(inp1, inp2);
            }
            Console.WriteLine("{0} Ticks", sw.Elapsed.Ticks);

            sw.Restart();
            for (var i = 0; i < 10; i++) {
                sumFast = SIMDMultAccumulate(inp1, inp2);
            }
            Console.WriteLine("{0} Ticks", sw.Elapsed.Ticks);

//            Assert.AreEqual(sumSafe, sumFast, 0.00000001);
        }
    }

}

The SIMD version needs around 70% more ticks compared to the nonSIMD version. I am running a Haswell architecture and imho. FMA3 should be implemented! (Release build, x64 prefered).

Any ideas? Thanks guys!

Styp
  • 261
  • 3
  • 12
  • 1
    Benchmark fail, times are completely dominated by jitting overhead. Hard to measure, this is extremely fast code. Put a for(;;) loop around it to run it 10 times to get rid of the jitting overhead and get a feel for the variability. Pick a larger number than 24 to see any real improvement. No FMA in the current code generator btw. – Hans Passant Jul 07 '18 at 17:57
  • I see your point, changed parameters around and got better results, but for my case it is probably not the right approach! – Styp Jul 07 '18 at 18:23
  • 3
    Use a true benchmark framework please, https://github.com/dotnet/BenchmarkDotNet Like Hans mentioned, only if you get rid of overhead the testing becomes useful. – Lex Li Jul 08 '18 at 04:30
  • 1
    I will try to do the BenchmarkDotNet thingy as soon as I have a some sprae time... – Styp Jul 08 '18 at 12:35
  • 1
    @harold deleted his answer, but it makes the important point that you should accumulate a vector of results, instead of narrowing to scalar inside the inner loop. And use multiple accumulators to hide FP latency. – Peter Cordes Jul 08 '18 at 23:48
  • Is the implementation even correct? I don't really know C#, but the way you construct `new Vector(inp1, i);` the same in the main and the remain-loop does not look right. Also, does C# optimize away the construction of `Vector`? – chtz Jul 10 '18 at 07:17

1 Answers1

3

Using BechmarkDotNet, I get almost double performance with SIMD Vector assuming that the input arrays have length (ITEMS = 10000) that is a multiple of Vector.Count:

    [Benchmark(Baseline = true)]
    public double DotDouble()
    {
        double returnVal = 0.0;
        for(int i = 0; i < ITEMS; i++)
        {
            returnVal += doubleArray[i] * doubleArray2[i];
        }
        return returnVal;
    }

    [Benchmark]
    public double DotDoubleVectorNaive()
    {
        double returnVal = 0.0;
        for(int i = 0; i < ITEMS; i += doubleSlots)
        {
           returnVal += Vector.Dot(new Vector<double>(doubleArray, i), new Vector<double>(doubleArray2, i));
        }
        return returnVal;  
    }

    [Benchmark]
    public double DotDoubleVectorBetter()
    {
        Vector<double> sumVect = Vector<double>.Zero;
        for (int i = 0; i < ITEMS; i += doubleSlots)
        {
            sumVect += new Vector<double>(doubleArray, i) * new Vector<double>(doubleArray2, i);
        }
        return Vector.Dot(sumVect, Vector<double>.One);
    }

    BenchmarkDotNet=v0.10.14, OS=Windows 10.0.17134
    Intel Core i7-4500U CPU 1.80GHz (Haswell), 1 CPU, 4 logical and 2 physical cores
    Frequency=1753758 Hz, Resolution=570.2041 ns, Timer=TSC
    .NET Core SDK=2.1.300
      [Host]     : .NET Core 2.1.0 (CoreCLR 4.6.26515.07, CoreFX 4.6.26515.06), 64bit RyuJIT
      DefaultJob : .NET Core 2.1.0 (CoreCLR 4.6.26515.07, CoreFX 4.6.26515.06), 64bit RyuJIT


                Method |      Mean |     Error |    StdDev | Scaled |
---------------------- |----------:|----------:|----------:|-------:|
             DotDouble | 10.341 us | 0.0902 us | 0.0844 us |   1.00 |
  DotDoubleVectorNaive |  5.907 us | 0.0206 us | 0.0183 us |   0.57 |
 DotDoubleVectorBetter |  4.825 us | 0.0197 us | 0.0184 us |   0.47 |

For completeness, RiuJIT will compile the Vector.Dot product on Haswell to:

vmulpd  ymm0,ymm0,ymm1            
vhaddpd ymm0,ymm0,ymm0    
vextractf128 xmm2,ymm0,1                
vaddpd  xmm0,xmm0,xmm2              
vaddsd  xmm6,xmm6,xmm0

Edited to add case with Dot product outside loop as per comment and ASm for dot product..

febstar
  • 17
  • 1
  • 7
C. Gonzalez
  • 689
  • 7
  • 8
  • You're still using `Vector.Dot()` inside the inner loop. That's nasty; use a vector accumulator and do the horizontal sum at the end. – Peter Cordes Jul 12 '18 at 00:01
  • 2
    Agrred. The point was to show the OP that his benchmarking was pro´lly off by a lot. WIll edit and add the best case. – C. Gonzalez Jul 12 '18 at 21:02
  • Or maybe the OP is on an AMD CPU, where `dppd` sucks even more than on Intel, assuming that's what `.Dot()` compiles / JITs to. (One per 3 clocks on Ryzen, vs. 1 per clock on your Haswell: http://agner.org/optimize/). Although unless the loop is unrolled with multiple accumulators, FP-add latency should still be the bottleneck. – Peter Cordes Jul 12 '18 at 21:18
  • Thanks, helped me a lot! – Styp Jul 19 '18 at 11:02