2

I've been working on a Deep Learning Library writing on my own. In matrix operations, getting the best performance is a key for me. I've been researching about programming languages and their performances on numeric operations. After a while, I found that C# SIMD has very similar performance with C++ SIMD. So, I decided to write the library in C#.

Firstly, I tested C# SIMD (I tested a lot of things, however not gonna write here). I noticed that it worked a lot better when using smaller arrays. The efficiency not good when using bigger arrays. I think it is ridiculous. Normally things work faster in terms of efficiency when they are bigger.

My question is "Why does vectorization work slower working with bigger arrays in C#?"

I am going to share benchmarks (done by myself) using BenchmarkNet.

Program.Size = 10

| Method |      Mean |     Error |    StdDev |
|------- |----------:|----------:|----------:|
|     P1 |  28.02 ns | 0.5225 ns | 0.4888 ns |
|     P2 | 154.15 ns | 1.1220 ns | 0.9946 ns |
|     P3 | 100.88 ns | 0.8863 ns | 0.8291 ns |

Program.Size = 10000

| Method |     Mean |    Error |   StdDev |   Median |
|------- |---------:|---------:|---------:|---------:|
|     P1 | 142.0 ms | 3.065 ms | 8.989 ms | 139.5 ms |
|     P2 | 170.3 ms | 3.365 ms | 5.981 ms | 170.1 ms |
|     P3 | 103.3 ms | 2.400 ms | 2.245 ms | 102.8 ms |

So as you see I increase the size 1000 times, meaning increasing the size of arrays 1000000 times. P2 took 154 ns at first. At second test, It took 170 ms which is what we expected 1000-ish times more. Also, P3 took exactly 1000 times more (100ns - 100ms) However, what I wanna touch here is that P1 which is vectorized loop has significantly lower performance than before. I wonder why.

Note that P3 is independent from this topic. P1 is the vectorized version of P2. So, we can say the efficiency of vectorization is P2/P1 in terms of the time they took. My code is like below:

Matrix class:

public sealed class Matrix1
{
    public float[] Array;
    public int D1, D2;
    const int size = 110000000;
    private static ArrayPool<float> sizeAwarePool = ArrayPool<float>.Create(size, 100);

    public Matrix1(int d1, int d2)
    {
        D1 = d1;
        D2 = d2;
        if(D1*D2 > size)
        { throw new Exception("Size!"); }
        Array = sizeAwarePool.Rent(D1 * D2);
    }

    bool Deleted = false;
    public void Dispose()
    {
        sizeAwarePool.Return(Array);
        Deleted = true;
    }

    ~Matrix1()
    {
        if(!Deleted)
        {
            throw new Exception("Error!");
        }
    }

    public float this[int x, int y]
    {
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        get
        {
            return Array[x * D2 + y];
        }
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        set
        {
            Array[x * D2 + y] = value;
        }
    }
}

Program Class:

public class Program
{
    const int Size = 10000;

    [Benchmark]
    public void P1()
    {
        Matrix1 a = Program.a, b = Program.b, c = Program.c;
        int sz = Vector<float>.Count;
        for (int i = 0; i < Size * Size; i += sz)
        {
            var v1 = new Vector<float>(a.Array, i);
            var v2 = new Vector<float>(b.Array, i);
            var v3 = v1 + v2;
            v3.CopyTo(c.Array, i);
        }

    }

    [Benchmark]
    public void P2()
    {
        Matrix1 a = Program.a, b = Program.b, c = Program.c;
        for (int i = 0; i < Size; i++)
            for (int j = 0; j < Size; j++)
                c[i, j] = a[i, j] + b[i, j];
    }
    [Benchmark]
    public void P3()
    {
        Matrix1 a = Program.a;
        for (int i = 0; i < Size; i++)
            for (int j = 0; j < Size; j++)
                a[i, j] = i + j - j; 
                //could have written a.Array[i*size + j] = i + j
                //but it would have made no difference in terms of performance.
                //so leave it that way
    }


    public static Matrix1 a = new Matrix1(Size, Size);
    public static Matrix1 b = new Matrix1(Size, Size);
    public static Matrix1 c = new Matrix1(Size, Size);

    static void Main(string[] args)
    {
        for (int i = 0; i < Size; i++)
            for (int j = 0; j < Size; j++)
                a[i, j] = i;
        for (int i = 0; i < Size; i++)
            for (int j = 0; j < Size; j++)
                b[i, j] = j;
        for (int i = 0; i < Size; i++)  
            for (int j = 0; j < Size; j++)
                c[i, j] = 0;

        var summary = BenchmarkRunner.Run<Program>();
        a.Dispose();
        b.Dispose();
        c.Dispose();
    }
}     

I assure you that x[i,j] doesn't affect the performance. Same as using x.Array[i*Size + j]

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Faruk NANE
  • 41
  • 11
  • P1 is using "new" which is calling a constructor for a class which adds significant time. – jdweng Aug 16 '19 at 09:29
  • 1
    Haven't looked at your code yet, but larger sizes are probably bottlenecked on memory bandwidth with SIMD. i.e. there's not much room for speedup before you hit a memory bandwidth bottleneck. But with data hot in L1d cache that can keep up with the SIMD ALUs, you can get close to the full 8x or whatever speedup over scalar, e.g. for AVX 8x 32-bit elements. – Peter Cordes Aug 16 '19 at 09:30
  • I do wonder if it's for the same reason as [here](https://stackoverflow.com/q/57458460/11683). – GSerg Aug 16 '19 at 09:33
  • @GSerg: nope, it's using `BenchmarkRunner.Run()` which will prevent optimization between different invocations of `P1()`. It can't hoist or sink the actual work out of the benchmark repeat-loop and only do it once. – Peter Cordes Aug 16 '19 at 09:50
  • 2
    @jdweng if you look into deep assembly code, new vector doesnt create an object. Vector class is totally different. you should search c# simd. – Faruk NANE Aug 16 '19 at 10:14
  • @petercordes So what you mean is that, when I don't use SIMD, it writes to the memory one at a time. When I use SIMD, it writes to the memory 8 at a time. The memory is being used more frequently. So that in large array cases it might be bottlenecked. Am I right? So in gpu case (just to make clear the problem, if I use gpu card to do the calculation). The gpu memory is much faster. We might not see the memory bottleneck situation like this. – Faruk NANE Aug 16 '19 at 10:20
  • @petercordes I was able to decrease the time of P1 from 140 ms to 120 ms when I use jagged arrays with vectors. Interesting.... I will say maybe its because jagged arrays are faster than 1 Dimensional array? but it is somehow more interesting now. There are many arrays on one side and here we have 1 array on the other side. 1 array should give better performance. I will test jagged arrays with smaller arrays to see if it is really faster in every case. – Faruk NANE Aug 16 '19 at 10:30
  • A vector is a class object and "new" calls the class constructor (and default constructor). – jdweng Aug 16 '19 at 10:31
  • @jdweng: The constructor is basically an intrinsic for a SIMD load instruction like `vmovups` or `vmovdqu`. And the destructor is trivial. (At least that's my understanding.) i.e. this is pretty much equivalent to C `_mm256_loadu_ps(array + i);` – Peter Cordes Aug 16 '19 at 10:42
  • Jagged array is slower than 1 array normally. But in larger array case, jagged array seems to be faster. Maybe it is because jagged array is using the different part of memory for each row-array. So it is less memory bottlenecked compared to the one big array. – Faruk NANE Aug 16 '19 at 11:02
  • On net core 3.0 preview 8 I get different results (Win10): although there is a loss of performance for the case of size=10000, the vectorized P1 still shows a +/- 3.6x improvement over P2. – C. Gonzalez Aug 20 '19 at 22:30
  • I am not expert on this topic. However, I can say a bit of what I learnt so far. Before everything, it depends on your system. Maybe your cpu is slower comparing to your memory, another case might be memory being faster comparing to your cpu. Or your cpu's cache etc etc. Little note: I also used net core 3.0 last prev. – Faruk NANE Aug 21 '19 at 00:58

1 Answers1

4

This might not be the whole story: the OP reports in comments that they sped up P1 from 140 to 120 ms with jagged arrays.

So maybe something extra is holding it back in the large case. I'd use performance counters to investigate and check for ld_blocks_partial.address_alias (4k aliasing -> false dependency of loads on stores). And/or look at the memory addresses you get from C# allocators and maybe see if they're close to but not quite all the same alignment relative to a 4k boundary.

I don't think needing 3 hot cache lines in the same set would be a problem; L1d is 8-way associative on any CPU that would give >4x speedups with AVX (i.e. with 256-bit load/store and ALUs). But if all your arrays have the same alignment relative to a 4k boundary, they will all alias the same set in a 32kiB L1d cache when you access the same index.

Oh, here's a theory: Jagged arrays stagger the page walks, instead of all 3 streams (2 src 1 dst) reaching a new page at the same time and all having a TLB miss that requires a walk. Try making sure your code uses 2M hugepages instead of just 4k to reduce TLB misses. (e.g. on Linux you'd use a madvise(buf, size, MADV_HUGEPAGE) system call.)

Check performance counter events for dtlb_load_misses.miss_causes_a_walk and/or dtlb_load_misses.stlb_hit. There is TLB prefetch so having them staggered can allow TLB prefetch to work on one or two in parallel instead of getting hit with all 3 page walks at once.


Large sizes bottleneck on memory bandwidth, not just ALU

SIMD doesn't increase available memory bandwidth, just how quickly you can get data in/out of cache. It increases how much memory bandwidth you can actually use most of the time. Doing the same work in fewer instructions can help OoO exec see farther ahead and detect TLB misses sooner, though.

The speedup is with large arrays is limited because scalar is already close-ish to bottlenecked on main memory bandwidth. Your C[i] = A[i]+B[i] access pattern is the STREAM sum access pattern, maximal memory access for one ALU operation. (1D vs. 2D indexing is irrelevant, you're still just reading / writing contiguous memory and doing pure vertical SIMD float addition. Explicitly in the P1 case.)

With small matrices (10x10 = 100 float = 400 bytes * (2 sources + 1 dst) = 1.2kB), your data can stay hot in L1d cache so cache misses won't bottleneck your SIMD loop.

With your src + dst hot in L1d cache, you can get close to the full 8x speedup over scalar AVX with 8x 32-bit elements per vector, assuming a Haswell or later CPU that has peak load+store throughput of 2x 32-byte vectors loads + 1x 32-byte vector store per clock cycle.

In practice you got 154.15 / 28.02 = ~5.5 for the small-matrix case.

Actual cache limitations apparently preclude that, e.g. Intel's optimization manual lists ~81 bytes / clock cycle typical sustained load + store bandwidth for Skylake's L1d cache. But with GP-integer loads + stores, Skylake can sustain 2 loads + 1 store per cycle for 32-bit operand-size, with the right loop. So there's some kind of microarchitectural limit other than load/store uop throughput that slows down vector load/store somewhat.


You didn't say what hardware you have, but I'm guessing it's Intel Haswell or later. "Only" 5.5x speedup might be due to benchmark overhead for only doing 12 or 13 loop iterations per call.

(100 elements / 8 elem/vec = 12.5. So 12 if you leave the last 4 elements not done, or 13 if you overread by 4 because your loop condition isn't i < Size * Size - sz + 1)

Zen's 2x 16-byte memory ops per clock (up to one of which can be a store) would slow down both scalar and AVX equally. But you'd still get at best 4x speedup going from 1 element per vector with movss / addss xmm, mem / movss to the same uops doing 4 elements at once. Using 256-bit instructions on Zen 1 just means 2 uops per instruction, with the same 2 memory uops per clock throughput limit. Better front-end throughput from using 2-uop instructions, but that's not the bottleneck here. (Assuming the compiler can make a loop in 5 uops or less it can issue at 1 iter per clock, and couldn't even run that fast because of the back-end bottleneck on load/store ports.)

Those results would also make sense on a Zen 2, I think: 256-bit SIMD execution units and I think also load/store ports mean that you can expect up to 8x speedups when doing 8x the amount of work per instruction.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1
    Thank you for your detailed answer. It's nice to see a qualified person here. I have kaby lake cpu i7 7700hq. When I set size to 8 (64 byte / 8 = 8 loop iterations per call), performance gain was ~5.85. When size = 10, it is ~5.5. When size = 32, it is ~5.35. I will read the answer you wrote in detail today I hope. I am not so capable of understanding your answer right now. But I understood the memory bottleneck issue ^^ – Faruk NANE Aug 16 '19 at 10:59
  • @FarukNANE: I just added a section with some guessword about jagged arrays. – Peter Cordes Aug 16 '19 at 11:00
  • To be honest, I really appreciate your answer but I am a new sophomore student. I really don't know some terms you mentioned such as 4k aliasing, alu, tbl or what cache really really does. I think my brain is gonna explode ^^. I learned assembly language (a bit of) yesterday ^^ . I've been researching for 1-2 weeks for optimization. That's me sorry :D. But I think I understood the concept here – Faruk NANE Aug 16 '19 at 11:14
  • @FarukNANE: ok, then yeah, scalar already comes close to main memory bandwidth is a detailed enough picture for you. See also [How can cache be that fast?](//electronics.stackexchange.com/q/329789) for some IvyBridge read/write/copy bandwidth numbers to get an idea. Anyway, hopefully at least some other readers of your question (now and in future) will get more out of my answer if they have more background in computer architecture. – Peter Cordes Aug 16 '19 at 11:20
  • @FarukNANE: I try to make the key points understandable to as many people as possible in my answers, as well as having my own fun getting super technical :) I think that worked out in this case since you did understand the memory bandwidth point. Also, thanks for checking the 32x32 case. I thought that might see a bigger speedup since `32 * 32 * 4 * 3 = ~12kiB` which will still fit in the smallest / fastest data cache (L1d) that's closest to the CPU core. – Peter Cordes Aug 16 '19 at 11:24
  • Yeah that worked out ^^. Here is one question more: "is it possible to do SIMD on every core?". We have an array, we could create 4 threads which take an each part of the array and do the operation with simd. It would be like x20 performance gain :D I will try this when I do parallelism in my library but not now sadly. – Faruk NANE Aug 16 '19 at 11:36
  • @FarukNANE: yes, SIMD is orthogonal to thread-level parallelism. (And to instruction-level parallelism in superscalar CPUs.) But remember that thread startup overhead is *large*, so it's only worth it if it's amortized over lots of work in each thread (large arrays). Even communication overhead with an already-running worker thread is larger than the cost of a small 100 element array add. Each core has private L1 and L2 caches, but compete for L3 and main memory bandwidth. So multithreading lets you benefit from more cache if 1/4 of your array fits in L1d but the whole thing doesn't. – Peter Cordes Aug 16 '19 at 11:42
  • @FarukNANE: Also, on a quad-core Intel desktop/laptop like yours, a single core can come close to saturating DRAM so again there's probably not much room for speedups over a single thread for the large-array case. On a big Xeon, you would actually get big speedups from multithreading with a big array. [Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?](//stackoverflow.com/q/39260020) - it takes multiple cores to saturate DRAM because there are other limits that you hit first for single-threaded memory bandwidth. – Peter Cordes Aug 16 '19 at 11:44
  • Seems a lot of important things to care. Yeah as you mentioned, on a single core I get bottlenecked with this cpu and memory, quad core would not make a difference at all. Thank you a lot. That's it for today ^^ I learned a lot. I understand that optimizing is really hard to do. Designing systems/libraries to be fast requires knowledge of low level optimization as well. Thank you again – Faruk NANE Aug 16 '19 at 12:01
  • Hi again, I am sorry to make busy you, but something is bothering me. I can't ask a question in stackoverflow I don't know why. Anyway, I have a c++ and c# code. Both are compiled with optimization arguments. In both c++ and c# I used Avx vectors. They give the same performance, is c++ supposed to be faster? or because the compilers write the vector codes in assembly in the same way, they give the same speed? the second question is – Faruk NANE Aug 17 '19 at 11:13
  • the second question is that, In c++ I wrote normal sum code with for loop and sum with avx vectors. in C++ both gave the same performance both the same amount of time they took. Why? does c++ compiler do the optimization already by vectorizing the loop? So that they give the same performance. Am I right? – Faruk NANE Aug 17 '19 at 11:16
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/198080/discussion-between-faruk-nane-and-peter-cordes). – Faruk NANE Aug 17 '19 at 11:19
  • @FarukNANE: For a specific loop over an array, you can get C# to generate efficient machine code these days, like C++. When C# is slower, it's because it's doing more work that requires more complicated machine code. (e.g. because it's a managed language: garbage collection overhead and so on). Or because the optimizer isn't as good as GCC or Clang's, or auto-vectorization worse than ICC. With a decent C++ compiler, you don't *have* to manually use `Vector` or `__m256`, you'd get SIMD machine code even from plain scalar source. (That would be your 2nd question). – Peter Cordes Aug 17 '19 at 13:47
  • @FarukNANE: I think Stack Overflow puts a rate-limit on questions from new users, especially if you've had posts that get downvotes or close votes. I didn't mind answering about C++ because some of that was related to this question, but generally you should wait until you can ask a new question, and spend some of time googling to find the answer to your own question. – Peter Cordes Aug 17 '19 at 13:49
  • I have a really strange issue you might wanna look at. Its so simple. I do dotproduct with two arrays whose size are (7168*7168). it takes 36ms. However when I change the size as (7169*7169). it takes 29-30ms. I noticed this when benchmarking with intel mkl. The code is here: https://github.com/faruknane/Performance-Work/issues/4 please take a look at – Faruk NANE Aug 30 '19 at 18:37
  • @Faruk 7168 has a large power of two as a factor. But it's read only so 4k aliasing shouldn't matter – Peter Cordes Aug 30 '19 at 22:15
  • it happens in case size = 7168*2 + 1 as well. It is faster than size = 7168*2. I try to analyze assembly code with intel vtune. they are same, somehow one is faster than the other. but it doesnt affect the intel mkl library. for size = 7169 our performance is same. for size = 7168 my library is slow however. Now I tested again, it affects intel mkl too. their code is 2-3 ms slower. my code is 6 ms slower. – Faruk NANE Aug 31 '19 at 10:16
  • @faruk sounds like a cache issue of some sort. – Peter Cordes Aug 31 '19 at 20:37