How to maximize DDR3 memory data transfer rate?

Question

I am trying to measure DDR3 memory data transfer rate through a test. According to the CPU spec. maximum theoritical bandwidth is 51.2 GB/s. This should be the combined bandwidth of four channels, meaning 12.8 GB/channel. However, this is a theoretical limit and I am curious of how to further increase the practical limit in this post. In the below described test scenario I achieve a ~14 GB/s data transfer rate which I believe may be a close approximation when killing most of the throuhgput boost of the CPU L1, L2, and L3 caches.

Update 20/3 2014: This assumption of killing the L1-L3 caches is wrong. The harware prefetching of the memory controller will analyze the data accesses pattern and since it sequential, it will have an easy task of prefetching data into the CPU caches.

Specific questions follow at the bottom but mainly I am interested in a) a verifications of the assumptions leading up to this result, and b) if there is a better way measuring memory bandwith in .NET.

I have constructed a test in C# on .NET as a starter. Although .NET is not ideal from a memory allocation perspective, I think it is doable for this test (please let me know if you disagree and why). The test is to allocate an int64 array and fill it with integers. This array should have data aligned in memory. Then I simply loop this array using as many threads as I have cores on the machine and read the int64 value from the array and set it to a local public field in the test class. Since the result field is public, I should avoid compiler optimising away stuff in the loop. Futhermore, and this may be a weak assumption, I think the result stays in the register and not written to memory until it is over written again. Between each read of an element in the array I use an variable Step offset of 10, 100, and 1000 in the array in order to not be able to fetch many references in the same cache block (64 byte).

Reading the Int64 from the array should mean a lookup read of 8 bytes and then the read of the actual value another 8 byte. Since data is fetched from memory in 64 byte cache line, each read in the array should correspond to a 64 byte read from RAM each time in the loop given that the read data is not located in any CPU caches.

Here is how I initiallize the data array:

_longArray = new long[Config.NbrOfCores][];
for (int threadId = 0; threadId < Config.NbrOfCores; threadId++)
{
    _longArray[threadId] = new long[Config.NmbrOfRequests];
    for (int i = 0; i < Config.NmbrOfRequests; i++)
        _longArray[threadId][i] = i;
}

And here is the actual test:

GC.Collect();
timer.Start();
Parallel.For(0, Config.NbrOfCores, threadId =>
{
    var intArrayPerThread = _longArray[threadId];
    for (int redo = 0; redo < Config.NbrOfRedos; redo++)
        for (long i = 0; i < Config.NmbrOfRequests; i += Config.Step) 
            _result = intArrayPerThread[i];                        
});
timer.Stop();

Since the data summary is quite important for the result I give this info too (can be skipped if you trust me...)

var timetakenInSec = timer.ElapsedMilliseconds / (double)1000;
long totalNbrOfRequest = Config.NmbrOfRequests / Config.Step * Config.NbrOfCores*Config.NbrOfRedos; 
var throughput_ReqPerSec = totalNbrOfRequest / timetakenInSec;
var throughput_BytesPerSec = throughput_ReqPerSec * byteSizePerRequest;
var timeTakenPerRequestInNanos = Math.Round(1e6 * timer.ElapsedMilliseconds / totalNbrOfRequest, 1);
var resultMReqPerSec = Math.Round(throughput_ReqPerSec/1e6, 1);
var resultGBPerSec = Math.Round(throughput_BytesPerSec/1073741824, 1);
var resultTimeTakenInSec = Math.Round(timetakenInSec, 1);

Neglecting to give you the actual output rendering code I get the following result:

Step   10: Throughput:   570,3 MReq/s and         34 GB/s (64B),   Timetaken/request:      1,8 ns/req, Total TimeTaken: 12624 msec, Total Requests:   7 200 000 000
Step  100: Throughput:   462,0 MReq/s and       27,5 GB/s (64B),   Timetaken/request:      2,2 ns/req, Total TimeTaken: 15586 msec, Total Requests:   7 200 000 000
Step 1000: Throughput:   236,6 MReq/s and       14,1 GB/s (64B),   Timetaken/request:      4,2 ns/req, Total TimeTaken: 30430 msec, Total Requests:   7 200 000 000

Using 12 threads instead of 6 (since the CPU is hyper threaded) I get pretty much the same throughput (as expected I think): 32.9 / 30.2 / 15.5 GB/s .

As can be seen, throughput drops as the step increases which I think is normal. Partly I think it is due to that the 12 MB L3 cache forces mores cache misses and partly it may be the Memory Controllers prefetch mechanism that is not working as well when the reads are so far apart. I further believe that the step 1000 result is the closest one to the actual practical memory speed since it should kill most of the CPU caches and "hopefully" kill the prefetch mechanism. Futher more I am assuming that most of the overhead in this loop is the memory fetch operation and not something else.

hardware for this test is: Intel Core I7-3930 (specs: CPU breif, more detailed, and really detailed spec ) using 32 GB total of DDR3-1600 memories.

Open questions

Am I correct in the assumptions made above?
Is there a way to increase the use of the memory bandwidth? For instance by doing it in C/C++ instead and spread out memory allocation more on heap enabling all four memory channels to be used.
Is there a better way to measure the memory data transfer?

Much obliged for input on this. I know it is a complex area under the hood...

All code here is available for download at https://github.com/Toby999/ThroughputTest. Feel free to contact me at an forwarding email tobytemporary[at]gmail.com.

Good question, if it had some code with what you tried, what you expected, and what you actually got. — Prashant Kumar, Dec 12 '13 at 22:58
@Prashant: I think the expected/actually-got are already present (51.2GB/s vs. ~10GB/s). — Oliver Charlesworth, Dec 12 '13 at 23:03
Ok. Sorry about this. I've updated the content now with code. — Toby999, Dec 13 '13 at 14:19
You'll have a difficult time realizing your full memory bandwidth with .NET. Usually this is reserved for those using SIMD, which .NET doesn't give any access to. — Cory Nelson, Dec 13 '13 at 15:30
I just implemened an SSE implementation in C++ as a part of this test project. But memory bandwidth utilisation is still interesting/important to know more about regardless of platform. Maybe converting the same test to C++ would bring better info and more possibilities. That's the number 2 question. :) — Toby999, Dec 13 '13 at 15:36
Shouldn't you be dividing by 1048576 to get MB/s? Although as you've divided by 1e9, I guess the divisor should be 1073741824 and the variable named `resultGBPerSec`. — Andrew Morton, Dec 17 '13 at 18:50
Yes, you are correct. I should use the binary representation for RAM data transfer rate. Though there is no MB measurement in the code given. Only MRequests so I made that clearer in the name too. Thanks. — Toby999, Dec 18 '13 at 14:58
Thanks. That may be a very to the point reason. I have to do some more reading as a follow up, but this paper: http://bit.ly/JIqTVz seems to be indicating exactly this: "[...]the coherence transactions that result when different processors update different words of the same cache block in an interleaved fashion [...] the measurements also show that poor spatial locality among accesses to shared data has an even larger impact". — Toby999, Dec 18 '13 at 18:38
51 GB/s is the bandwidth of a graphic cards, I'd be suprised if main memory can actually burst that fast. But well maybe it can. I'd also like to know what SIMD can improve in memory since in my understanding SIMD is about CPU instructions and registers, nothing to do with how data transfers from RAM ? Lastly, isn't 51 GB a marketting figure that can only happen when memory is accessed by 4 threads using each its own NUMA node ? — v.oddou, Dec 19 '13 at 05:34
Yes, I think you are correct that 51 GB/s only can be reached in very special occiasions when each memory channel is used to it's maximum using a NUMA cofiguration which might be tricky to achieve in .NET. Regarding SIMD execution, it is my understand that Intel processors still use the L1-L3 CPu caches simularly as the normal processing. This is a good thing. Though I have read somewhere that it is possible to bypass the CPU caches for writes, but I am not sure if possible for reads as well. If so, I think it could be useful of avoiding cache coherence problems in certain scnarios. A TODO... — Toby999, Dec 20 '13 at 12:30
When you write to the same field from many threads you are pinging the cache line between cores. This should be very expensive. Try summing all array elements to a local variable. Summing is a cheap operation. Shove the final sum into GC.KeepAlive. I don't see why you shouldn't be able to max out 51GB/sec with .NET and with 8 threads. Thats 6GB/sec per core. You have about 3g instructions per sec. You need and avg. of 2 bytes per cycle which is easy. Unroll the loop a bit. Move the Config accesses to local variables. Don't trust the JIT to optimize anything. — usr, Dec 27 '13 at 14:30
Can you host a self-contained code snippet somewhere? I'll try improving it. — usr, Dec 27 '13 at 14:32
@Toby999 honestly I don't think this can be measured in user mode. The simple fact that you won't get a full second on the CPU before you have filled your tick in most cases (win 8 is different as it's a 'tickless' OS, but even in such an OS you're not likely to get a full second of CPU time contiguously). When you get the CPU back the cache will have been invalidated and you'll also have page faults to deal with which will considerably slow this down. — Mgetz, Dec 27 '13 at 14:33
Usr: Sorry. I have been really busy with something else at work so I had to put the lid on this for a while. I've updated the post now though to a Github project where you can download the whole .NET project if yuo want to. I hope to have the time soon myself to test some of the proposals below. — Toby999, Feb 20 '14 at 18:09
Mgetz: Thanks for your input. I'll take it into my hopefully upcoming further analysis of this. — Toby999, Feb 20 '14 at 18:10

score 5 · Answer 1 · edited May 23 '17 at 12:11

The decrease in throughput as you increase step is likely caused by the memory prefetching not working well anymore if you don't stride linearly through memory.

Things you can do to improve the speed:

The test speed will be artificially bound by the loop itself taking up CPU cycles. As Roy shows, more speed can be achieved by unfolding the loop.
You should get rid of boundary checking (with "unchecked")
Instead of using Parallel.For, use Thread.Start and pin each thread you start on a separate core (using the code from here: Set thread processor affinity in Microsoft .Net)
Make sure all threads start at the same time, so you don't measure any stragglers (you can do this by spinning on a memory address that you Interlock.Exchange to a new value when all threads are running and spinning)
On a NUMA machine (for example a 2 Socket Modern Xeon), you may have to take extra steps to allocate memory on the NUMA node that a thread will live on. To do this, you need to PInvoke VirtualAllocExNuma
Speaking of memory allocations, using Large Pages should provide yet another boost

While .NET isn't the easiest framework to use for this type of testing, it IS possible to coax it into doing what you want.

Thanks for this input Thomas. And especially for supporting my hypothesis that it is possible on .NET. :) Sorry I have not had the time to comment nor try your proposals out yet, but I hope to be able to this now soon. — Toby999, Feb 20 '14 at 18:12

score 2 · Answer 2 · answered Dec 28 '13 at 12:23

Reported RAM results (128 MB) for my bus8thread64.exe benchmark on an i7 3820 with max memory bandwidth of 51.2 GB/s, vary from 15.6 with 1 thread, 28.1 with 2 threads to 38.7 at 8 threads. Code is:

   void inc1word(IDEF data1[], IDEF ands[], int n)
    {
       int i, j;

       for(j=0; j<passes1; j++)
       {
           for (i=0; i<wordsToTest; i=i+64)
           {
               ands[n] = ands[n] & data1[i   ] & data1[i+1 ] & data1[i+2 ] & data1[i+3 ]
                                 & data1[i+4 ] & data1[i+5 ] & data1[i+6 ] & data1[i+7 ]
                                 & data1[i+8 ] & data1[i+9 ] & data1[i+10] & data1[i+11]
                                 & data1[i+12] & data1[i+13] & data1[i+14] & data1[i+15]
                                 & data1[i+16] & data1[i+17] & data1[i+18] & data1[i+19]
                                 & data1[i+20] & data1[i+21] & data1[i+22] & data1[i+23]
                                 & data1[i+24] & data1[i+25] & data1[i+26] & data1[i+27]
                                 & data1[i+28] & data1[i+29] & data1[i+30] & data1[i+31]
                                 & data1[i+32] & data1[i+33] & data1[i+34] & data1[i+35]
                                 & data1[i+36] & data1[i+37] & data1[i+38] & data1[i+39]
                                 & data1[i+40] & data1[i+41] & data1[i+42] & data1[i+43]
                                 & data1[i+44] & data1[i+45] & data1[i+46] & data1[i+47]
                                 & data1[i+48] & data1[i+49] & data1[i+50] & data1[i+51]
                                 & data1[i+52] & data1[i+53] & data1[i+54] & data1[i+55]
                                 & data1[i+56] & data1[i+57] & data1[i+58] & data1[i+59]
                                 & data1[i+60] & data1[i+61] & data1[i+62] & data1[i+63];
           }
        }
    }

This also measures burst reading speeds, where max DTR, based on this, is 46.9 GB/s. Benchmark and source code are in:

http://www.roylongbottom.org.uk/quadcore.zip

For results with interesting speeds using L3 caches are in:

http://www.roylongbottom.org.uk/busspd2k%20results.htm#anchor8Thread

Forgot to say that each thread has a separate array allocated as (X = 1 to 8): arrayX = (IDEF *)_aligned_malloc(memoryBytes[sizes-1], 16); IDEF is int or __int64 for 32 or 64 bit versions — Roy Longbottom, Dec 28 '13 at 19:14
Thanks for input. I'll give your benchmark a spin soon and perhaps it is good enough for what I need. I apologize that it has taken me so long time to get back on this track. Hopefully soon I'll be able to reflect on your work. — Toby999, Feb 20 '14 at 18:14

Caleb Everett · Answer 3 · 2014-02-21T12:34:04.953

1

C/C++ would give a more accurate metric of memory performance as .NET can sometimes do some weird things with memory handling and won't give you an accurate picture since it doesn't use compiler intrinsics or SIMD instructions.

There's no guarantee that the CLR is going to give you anything capable of truly benchmarking your RAM. I'm sure there's probably software already written to do this. Ah, yes, PassMark makes something: http://www.bandwidthtest.net/memory_bandwidth.htm

That's probably your best bet as making benchmarking software is pretty much all they do. Also, nice processor btw, I have the same one in one of my machines ;)

UPDATE (2/20/2014): I remember seeing some code in the XNA Framework that did some heavy duty optimizations in C# that may give you exactly what you want. Have you tried using "unsafe" code and pointers?

edited Feb 21 '14 at 12:34

answered Dec 27 '13 at 14:20

Caleb Everett

436
3
7

Thanks Caleb for your input. I'll include it in my hopefully upcoming further investigation on this. And yeah, the processor is nice, but now I have come to realize I need a Haswell based architecture instead in order to be able to try out some AVX2 (SIMD) intrinsics methods. :( – Toby999 Feb 20 '14 at 18:18
I have a Haswell CPU in my home computer. Core i7 4770K. I could run the benchmarks for you if you want. – Caleb Everett Feb 21 '14 at 12:30
Hmm. Thanks. That would be great. It could give me the input if it would worth upgrading. Though it is not really this benchmark but more the full scale of the current investigation I'm doing. But perhaps I can tell you more about via mail if you are interested. I can be reached at tobytemporary[at]gmail.com (and I'll repond with my real address). – Toby999 Feb 21 '14 at 14:23
regarding unsafe code and pointers. Nope. Not yet. I could try that I guess since I most likely also will test writing this in C++ instead. Though my previous experience is that the mere C++ compiler makes a huge difference compared to the C#/JIT compiler. – Toby999 Feb 21 '14 at 14:44

How to maximize DDR3 memory data transfer rate?

3 Answers3