I've been working on a Deep Learning Library writing on my own. In matrix operations, getting the best performance is a key for me. I've been researching about programming languages and their performances on numeric operations. After a while, I found that C# SIMD has very similar performance with C++ SIMD. So, I decided to write the library in C#.
Firstly, I tested C# SIMD (I tested a lot of things, however not gonna write here). I noticed that it worked a lot better when using smaller arrays. The efficiency not good when using bigger arrays. I think it is ridiculous. Normally things work faster in terms of efficiency when they are bigger.
My question is "Why does vectorization work slower working with bigger arrays in C#?"
I am going to share benchmarks (done by myself) using BenchmarkNet.
Program.Size = 10
| Method | Mean | Error | StdDev |
|------- |----------:|----------:|----------:|
| P1 | 28.02 ns | 0.5225 ns | 0.4888 ns |
| P2 | 154.15 ns | 1.1220 ns | 0.9946 ns |
| P3 | 100.88 ns | 0.8863 ns | 0.8291 ns |
Program.Size = 10000
| Method | Mean | Error | StdDev | Median |
|------- |---------:|---------:|---------:|---------:|
| P1 | 142.0 ms | 3.065 ms | 8.989 ms | 139.5 ms |
| P2 | 170.3 ms | 3.365 ms | 5.981 ms | 170.1 ms |
| P3 | 103.3 ms | 2.400 ms | 2.245 ms | 102.8 ms |
So as you see I increase the size 1000 times, meaning increasing the size of arrays 1000000 times. P2 took 154 ns at first. At second test, It took 170 ms which is what we expected 1000-ish times more. Also, P3 took exactly 1000 times more (100ns - 100ms) However, what I wanna touch here is that P1 which is vectorized loop has significantly lower performance than before. I wonder why.
Note that P3 is independent from this topic. P1 is the vectorized version of P2. So, we can say the efficiency of vectorization is P2/P1 in terms of the time they took. My code is like below:
Matrix class:
public sealed class Matrix1
{
public float[] Array;
public int D1, D2;
const int size = 110000000;
private static ArrayPool<float> sizeAwarePool = ArrayPool<float>.Create(size, 100);
public Matrix1(int d1, int d2)
{
D1 = d1;
D2 = d2;
if(D1*D2 > size)
{ throw new Exception("Size!"); }
Array = sizeAwarePool.Rent(D1 * D2);
}
bool Deleted = false;
public void Dispose()
{
sizeAwarePool.Return(Array);
Deleted = true;
}
~Matrix1()
{
if(!Deleted)
{
throw new Exception("Error!");
}
}
public float this[int x, int y]
{
[MethodImpl(MethodImplOptions.AggressiveInlining)]
get
{
return Array[x * D2 + y];
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
set
{
Array[x * D2 + y] = value;
}
}
}
Program Class:
public class Program
{
const int Size = 10000;
[Benchmark]
public void P1()
{
Matrix1 a = Program.a, b = Program.b, c = Program.c;
int sz = Vector<float>.Count;
for (int i = 0; i < Size * Size; i += sz)
{
var v1 = new Vector<float>(a.Array, i);
var v2 = new Vector<float>(b.Array, i);
var v3 = v1 + v2;
v3.CopyTo(c.Array, i);
}
}
[Benchmark]
public void P2()
{
Matrix1 a = Program.a, b = Program.b, c = Program.c;
for (int i = 0; i < Size; i++)
for (int j = 0; j < Size; j++)
c[i, j] = a[i, j] + b[i, j];
}
[Benchmark]
public void P3()
{
Matrix1 a = Program.a;
for (int i = 0; i < Size; i++)
for (int j = 0; j < Size; j++)
a[i, j] = i + j - j;
//could have written a.Array[i*size + j] = i + j
//but it would have made no difference in terms of performance.
//so leave it that way
}
public static Matrix1 a = new Matrix1(Size, Size);
public static Matrix1 b = new Matrix1(Size, Size);
public static Matrix1 c = new Matrix1(Size, Size);
static void Main(string[] args)
{
for (int i = 0; i < Size; i++)
for (int j = 0; j < Size; j++)
a[i, j] = i;
for (int i = 0; i < Size; i++)
for (int j = 0; j < Size; j++)
b[i, j] = j;
for (int i = 0; i < Size; i++)
for (int j = 0; j < Size; j++)
c[i, j] = 0;
var summary = BenchmarkRunner.Run<Program>();
a.Dispose();
b.Dispose();
c.Dispose();
}
}
I assure you that x[i,j]
doesn't affect the performance. Same as using x.Array[i*Size + j]