Effective main memory to CPU maximum bandwidth in C#

Question

I want to write a C# program capable of running basic operations on data read from the main memory so that can I can get as close as possible to the main memory read bandwidth.

I guess we can be sure the cache is not used when using very large arrays. So far, using multiple threads and long[] I've never been able to cross the 2 GB/s seconds limit while I know modern RAM bandwidth is more like 10 GB/s at least. (I have a modern computer and run in 64 bits, release mode without debugging of course).

Can you provide a C# program capable of getting close to the maximum bandwidth? If not could you explain why a C# program can't do it?

For example:

Prepare: create a (several?) large array and fill it with random numbers
Main step: sum (or any low CPU operation) all the elements in the array

do you mean you want to get the memory details or are we talking about benchmarking here? — Software Dev, May 05 '18 at 14:52
`Can you provide a C# program capable of` Stackoverflow is not a code writing service, nor a tutorial site. Please read [ask] and take the [tour] — Ňɏssa Pøngjǣrdenlarp, May 05 '18 at 14:56
@Zack: I want to benchmark a program who reads and uses the content of the RAM. This is not for benchmarking some specific RAM device if this was your question. It's more about C# benchmarking than RAM benchmarking. — Benoit Sanchez, May 05 '18 at 15:03
I am asking for clarification , i wouldn't have if your question was clear enough and contained much resource — Software Dev, May 05 '18 at 15:25

score 3 · Accepted Answer · answered May 05 '18 at 15:45

Assuming you mean single-threaded bandwidth, that's fairly easy, for example like this:

uint[] data = new uint[10000000 * 32];
for (int j = 0; j < 15; j++)
{
    uint sum = 0;
    var sw = Stopwatch.StartNew();
    for (uint i = 0; i < data.Length; i += 64)
    {
        sum += data[i] + data[i + 16] + data[i + 32] + data[i + 48];
    }
    sw.Stop();
    long dataSize = data.Length * 4;
    Console.WriteLine("{0} {1:0.000} GB/s", sum, dataSize / sw.Elapsed.TotalSeconds / (1024 * 1024 * 1024));
}

On my machine I get around 19.8-20.1 GB/s from this, and I know the single-threaded bandwidth is supposed to be around 20 GB/s so that seems fine. The multithreaded bandwidth on my machine is actually higher, around 30 GB/s, but that would take a more complex test that coordinates at least two threads.

Some tricks are necessary in this benchmark. Most importantly, I rely on a cache line size of 64 bytes to be able to skip doing anything with most of the data. Since the code does touch every cache line (minus, perhaps, one or two at the start and end due to the array not necessarily being 64-aligned), the entire array will be transferred from memory. Just in case it mattered (it did change the results a little, so I kept it) I unrolled the loop by 4, and made the index variable unsigned to avoid pointless movsx instructions. Saving operations is, especially with scalar code like this, important in order to try to avoid making that the bottleneck, rather than memory bandwidth.

However, this does not really benchmark the total memory bandwidth the system has available, which on my system is not possible from a single core. There are certain microarchitectural details that can limit memory bandwidth to a single core to be less than the total memory bandwidth the whole processor has. You can read about various details in this answer by BeeOnRope.

Many thanks. That's exactly what I wanted to know. It works a 13 GB/s on my computer. I need to give it more thoughts. I wonder if a multi-threaded version could really perform operations on all elements of the array while staying close to the bandwidth max (since there should be less CPU bottleneck). I'll work this out. — Benoit Sanchez, May 05 '18 at 16:00
@BenoitSanchez the `System.Numerics.Vectors` API is not very powerful, but perhaps it can help you write some code that can do more operations while still maxing out memory bandwidth — harold, May 05 '18 at 16:05
My multi-thread bandwidth is 20 GB/s. Not only I could reach it but actually the multi-threaded version is capable of reading all elements in the array(s) without much loss (say around 17 GB/s) even with the raw "for (uint i = 0; i < data.Length; i++)" version. Thanks a lot. I was not aware of the fact you must carefully design the CPU to be very "light". Thanks the Vectors idea. I'll have a look. — Benoit Sanchez, May 05 '18 at 16:42

Benoit Sanchez · Answer 2 · 2018-05-05T17:29:45.050

This is the multi-threaded version that follows @harold's (very good) answer.

The for loop reading one element out of 16 reaches the multi-trheaded bandwidth. But actually the basic for loop reading all elements is not far from it because the CPU bottleneck is less a problem in the multithreaded version.

int N = 64;
uint[][] data = new uint[N][];
for (int k = 0; k < N; k++)
{
   data[k] = new uint[1000000 * 32];
}
for (int j = 0; j < 15; j++)
{
    long total = 0;
    var sw = Stopwatch.StartNew();
    Parallel.For(0, N, delegate (int k)
    {
       uint sum = 0;
       uint[] d = data[k];
       //for (uint i = 0; i < d.Length; i += 64)
       //{
       //    sum += d[i] + d[i + 16] + d[i + 32] + d[i + 48];
       //}
       for (uint i = 0; i < d.Length; i++)
       {
          sum += d[i];
       }
       Interlocked.Add(ref total, sum);
     });
     sw.Stop();
     long dataSize = (long)data[0].Length* N * 4;
     Console.WriteLine("{0} {1:0.000} GB/s", total, dataSize / sw.Elapsed.TotalSeconds / (1024 * 1024 * 1024));
}

For information measurements on my laptop:

mono-threaded bandwidth: 13 GB/s
multi-threaded bandwidth: 20 GB/s
mutli-threaded reading all elements: 17 GB/s

Effective main memory to CPU maximum bandwidth in C#

2 Answers2