Parallel memory access on modern processors

Question

I have a recent 12 core Intel CPU (Haswell architecture) and it has 4 memory channels. How many DRAM memory accesses can the machine do in parallel?

For example, if I have a program that uses 12 threads that sit in a tight loop that reads a single byte from random memory addresses over a range too large to fit in cache. I expect all 12 threads will spend almost all of their time waiting for the memory fetch.

Do the threads have to take it in turns to use the DRAM bus?

NOTE: Assume I'm using a 1-GB VM page size so there are no TLB cache misses.

This is more about computer engineering than computer science. — Yuval Filmus, Jul 21 '17 at 11:14
Your program should also be a Parallel program so that it can take full advantage of the CPU capacity. This means writing programs in parallel with something like MPI or OpenMP. — Juniar, Jul 21 '17 at 17:35
Given that this question is solely about how a specific processor model works, rather than how processors can be made to work, there's no science involved. Therefore I am migrating this question to an engineering site. — Gilles 'SO- stop being evil', Jul 28 '17 at 21:51
@Giles - Just to head off any potential backlash... I didn't put the question on Stackoverflow in the first place because it does fulfil the "tell me what specific problem you are trying to solve" requirement. Meta stackexchange told me to ask computer architecture questions on the CS site (https://meta.stackexchange.com/questions/193961/where-i-should-ask-computer-architecture-questions). I'm quite happy for it to stay here, as long as other people are. I can see it wasn't an ideal fit on CS stackexchange site either. — Andrew Bainbridge, Jul 29 '17 at 07:22

Andrew Bainbridge · Accepted Answer · 2019-08-27T09:23:31.293

The Intel datasheets almost answer this question.

My first clue was a question on the Intel forums: https://communities.intel.com/thread/110798

Jaehyuk.Lee, 01-Feb-2017 09:27 asked almost the same question as me:

The Second question is about simultaneous requests on IMC and its support on brand-new CPU models such as skylake and kaby-lake http://www.intel.com/Assets/PDF/datasheet/323341.pdf Following the above link, "The memory controller can operate on up to 32 simultaneous requests(reads and writes)" I would like to know how many simultaneous requests are supported in skylake and kabylake CPUs. I've already checked the 6th and 7th generation of the Intel CPU datasheet, but I cannot find any information.

The link is dead. But his "32" figure sounds plausible.

An Intel staff member responded, quoting from 6th Generation Intel® Processor Families for S-Platforms, Vol 1:

The memory controller has an advanced command scheduler where all pending requests are examined simultaneously to determine the most efficient request to be issued next. The most efficient request is picked from all pending requests and issued to system memory Just-in-Time to make optimal use of Command Overlapping. Thus, instead of having all memory access requests go individually through an arbitration mechanism forcing requests to be executed one at a time, they can be started without interfering with the current request allowing for concurrent issuing of requests. This allows for optimized bandwidth and reduced latency while maintaining appropriate command spacing to meet system memory protocol.

Annoyingly the datasheet for my Xeon E5-2670 v3 does not contain an equivalent section.

Another part of the answer is that the E5-2670 has 4 DDR channels. Memory is interleaved on a 256 byte granularity to optimize bandwidth. In other words, if you read a 1024 byte block from address 0, the first 256 bytes are fetched from DIMM 0. Bytes 256 to 511 are from DIMM 1 etc.

Putting the two together, I suspect that the memory controller can do 4 reads in parallel and is smart enough that if 4 or more threads are waiting on reads that map into 4 different DIMMs, it will do them in parallel. And it has enough hardware to keep up to about 32 reads/writes in its scheduling table.

I can think of another possible way to achieve parallelism. Each DDR channel has its own data and address buses. When the memory controller requests a read, it uses the address lines + some control lines to request the read and then waits for the response. For a random read, typically there are two waits - the RAS to CAS delay and CAS latency - about 15 cycles each. Instead of leaving the address lines idle, you could imagine the memory controller starting another read from a different DIMM (*) during these wait periods. I have no idea if this is done.

* Actually, according to this Anandtech article there is more parallelness in the DRAM hardware than just having multiple DIMMs per channel. Each DIMM might have multiple ranks and each rank has many banks. I think you could switch to any other rank and bank within a DIMM to perform another access in parallel.

EDIT

I measured that my machine can do at least 6 random accesses in parallel, despite only having 4 memory channels. So a single memory channel can perform 2 or more random accesses in parallel, perhaps using the scheme I described in the paragraph above.

To get this information, I used tinymembench to measure the latency of DRAM access on my machine. The result was 60 ns. I then wrote a small C program to perform 32-bit reads from a 1 GB table of random numbers and use the result to increment a checksum. Pseudo code:

uint32_t checksum = 0;
for (int i = 0; i < 256 * 1024 * 1024; i++) {
    unsigned offset = rand32() & (TABLE_SIZE - 1);
    checksum += table_of_random_numbers[offset];
}

Each iteration of the loop took 10 ns on average. This is because the out-of-order and speculative execution features in my CPU were able to parallelize this loop 6 times. ie 10 ns = 60 ns / 6.

If instead I replaced the code with:

unsigned offset = rand32() & (TABLE_SIZE - 1);
for (int i = 0; i < 256 * 1024 * 1024; i++) {
    offset = table_of_random_numbers[offset];
    offset &= (TABLE_SIZE - 1);
}

Then each iteration takes 60 ns, because the loop cannot be parallized. It cannot be parallized because the address of each access depends on the result of the previous read.

I also checked the assembly the compiler generated to make sure it hadn't done the parallizing.

EDIT 2

I decided to test what happens when I run multiple tests in parallel, each as a separate process. I used the program snippet above that includes the checksum (ie the one that appears to see a latency of 10 ns per access). By running 6 instances in parallel, I get an average apparent latency of 13.9 ns, meaning that about 26 accesses must be occurring in parallel. (60 ns / 13.9 ns) * 6 = 25.9.

6 instances was optimal. Any more caused the overall throughput to drop.

EDIT 3 - Responding the Peter Cordes RNG question

I tried two different random number generators.

uint32_t g_seed = 12345;
uint32_t fastrand() {
    g_seed = 214013 * g_seed + 2531011;
    return g_seed;
}

and

// *Really* minimal PCG32 code / (c) 2014 M.E. O'Neill / pcg-random.org
// Licensed under Apache License 2.0 (NO WARRANTY, etc. see website)
typedef struct { uint64_t state;  uint64_t inc; } pcg32_random_t;

uint32_t pcg32_random_r(pcg32_random_t* rng)
{
    uint64_t oldstate = rng->state;
    // Advance internal state
    rng->state = oldstate * 6364136223846793005ULL + (rng->inc|1);
    // Calculate output function (XSH RR), uses old state for max ILP
    uint32_t xorshifted = ((oldstate >> 18u) ^ oldstate) >> 27u;
    uint32_t rot = oldstate >> 59u;
    return (xorshifted >> rot) | (xorshifted << ((-rot) & 31));
}

They both performed about the same. I can't remember the exact numbers. The peak single threaded performance I saw was with the simpler RNG, and it gave me an amortised latency of 8.5 ns, implying 7 reads in parallel. The assembly for the timed loop was:

// Pseudo random number is in edx
// table is in rdi
// loop counter is in rdx
// checksum is in rax
.L8:
        imull   $214013, %edx, %edx
        addl    $2531011, %edx
        movl    %edx, %esi
        movl    %edx, g_seed(%rip)
        andl    $1073741823, %esi
        movzbl  (%rdi,%rsi), %esi
        addq    %rsi, %rax
        subq    $1, %rcx
        jne     .L8
        ret

I don't understand "g_seed(%rip)". Is that a memory access? Why would the compiler do that?

EDIT 4 - Removed global variable from random number generator

I removed the global variable from the random number generator as suggested by Peter. The generated code was indeed cleaner. I also switched to Intel syntax for the disassembly (thanks for the tip).

// Pseudo random number is in edx
// table is in rdi
// loop counter is in rdx
// checksum is in rax
.L8:
        imul    edx, edx, 214013
        add     edx, 2531011
        mov     esi, edx
        and     esi, 1073741823
        movzx   esi, BYTE PTR [rdi+rsi]
        add     rax, rsi
        sub     rcx, 1
        jne     .L8
        ret

The performance was unchanged though, both the single process and multi-process cases.

Memory is heavily pipelined. If the load addresses are not dependent on data from previous loads, even a single core can have many outstanding loads in flight at once. This is one of the major benefits of out-of-order execution. I think the main limitation for this case (requests to different cache-lines) is that a single Sandybridge-family core 10 line-fill buffers (http://www.realworldtech.com/haswell-cpu/5/). So a single core can be waiting for up to 10 cache lines from its private L2. (Which in turn has to request them from shared L3). — Peter Cordes, Jul 29 '17 at 04:50
Also note that Intel CPUs with more cores have higher latency between the cores and L3/memory. A quad-core desktop typically has *better* single-threaded memory bandwidth / latency than a 12-core Xeon. A quad-channel Xeon has great aggregate memory bandwidth and can keep more cores fed, but it takes more than twice as many cores to saturate its 4 channels. See also the "latency bound" [section of this answer](https://stackoverflow.com/a/43574756/224132) — Peter Cordes, Jul 29 '17 at 04:55
Interesting result that a single core only managed about 6 in-flight cache-miss-loads on average. It's nice to have data points for both dependent and independent loads. I wonder if that's from limited concurrency between L2<->L3, or from something else? I wonder how slow your `rand32()` function is, and if it did any memory access (which would hit in cache, but that the CPU would have to speculatively reorder with the cache-miss loads). Also, if it's too slow, ROB-size limitations could prevent OOO execution from seeing that big a window of instructions. — Peter Cordes, Jul 29 '17 at 05:04
A simple inline `xorshift+` PRNG or a simple increment with a very large stride (modulo something using `&`) might work well, and make sure that more than 10 loads could be in-flight in one core at once. — Peter Cordes, Jul 29 '17 at 05:06
@Peter - Thanks for the links and comments. I've updated my answer to hopefully answer your questions. — Andrew Bainbridge, Jul 29 '17 at 08:17
You're using AT&T syntax for some reason (I like to use `gcc -masm=intel` and `objdump -drwC -Mintel`), so yes, `g_seed(%rip)` is a static memory location using RIP-relative addressing (normal for x86-64). Use `static g_seed = ...` (preferably as a function-scoped variable, but file scope works too as long as its `static` not global) so the compiler can optimize it into a register instead of keeping it around for external functions to see. — Peter Cordes, Jul 29 '17 at 08:33
globals suck for performance as well as human-readability. Better to pass by reference. And yeah, your `fastrand()` linear congruential generator is great. It's even simpler than `xorshift`, but still plenty random enough. Even better, it doesn't repeat the same address until it's seen every other address, because the period is the same as the range. That's a feature in this case. — Peter Cordes, Jul 29 '17 at 08:35
Wait a minute, I thought you were testing read-only loads. But `addb $1, (%rdi,%rax)` is writing the table, like `table_of_random_numbers[offset]++`. That might explain getting less than 10 outstanding loads per core, since some LFBs are busy with stores, when dirty cache lines have to be evicted. Also note that `char*` can alias anything, so if the compiler wasn't sure that the table stores weren't aliasing `g_seed`, it would have to store/reload it inside the loop. Yet another reason to make it a local and pass it to the PRNG by reference, instead of using a global (or even `static`). — Peter Cordes, Jul 29 '17 at 08:39
"it doesn't repeat the same address until it's seen every other address". Well spotted sir. And the assembly I posted is cut and pasted from the wrong part of my notes. I'll correct this in a day or so (out of time at the moment). — Andrew Bainbridge, Jul 29 '17 at 10:16
BTW, since you're masking the output of your LCG, that's not strictly true. If you made the modulo part of the RNG itself, and chose the multiply and add parameters accordingly, you could build an LCG with a period that matched your buffer size. I wanted that for arbitrary `n` a while ago (which requires throwing out some out-of-range values, and using a prime modulus). Here's [the code](https://github.com/pcordes/allspr/blob/master/lcg.c#L23). I got the theory from Knuth's The Art of Computer Programming. — Peter Cordes, Jul 29 '17 at 10:27
However, since it needs to `% m` by a prime number, it's slower than just letting it wrap at 2^32. (Much slower with a run-time constant instead of a compile-time constant that allows the multiplication trick, since integer division has at least 22 cycle latency on Haswell). — Peter Cordes, Jul 29 '17 at 10:28
Yeah as @PeterCordes kind of hinted at above, the focus on the memory controller parallelism probably doesn't give you the full picture. The path to memory is quite long and in fact the actual DRAM access is only a short part of the wall clock time. Getting all the way from the core out to the controller and back to the L1 takes most of the time. The "32 simultaneous requests" is more about giving the memory controller lots of visibility into pending requests so it can schedule them well (trying to group reads to the same page to avoid the page-open delay). — BeeOnRope, Jul 31 '17 at 01:15
The actual parallelism at the DRAM level is limited by the fact that you only have N wires to/from the controller and DRAM, one controller will generally be sending 1 operation (or can it do 2 with dual channel? I'm not sure) at a time to DRAM. Most of the parallelism comes from the fact that the entire path up to that point can keep multiple requests in flight at the same time, and this would be true even if the memory controller was stupid (i.e., FIFO request handling). Here's a [great post](http://sites.utexas.edu/jdm4372/2011/03/10/memory-latency-components/) on memory latency components. — BeeOnRope, Jul 31 '17 at 01:18
@BeeOnRope. I take your point. I think the amount of latency in the CPU socket is about the same as that in the DRAM subsystem (maybe 30 ns each). I think the article you linked underestimates the DRAM latency because it doesn't seem to include the Row-column delay, which is ~15 ns on my system. This diagram from the Anandtech article http://images.anandtech.com/doci/3851/Back%20to%20Back%20Burst%20Read%20with%20Page%20Close.png shows exactly what is happening in the DRAM. — Andrew Bainbridge, Jul 31 '17 at 10:01
Andrew, you are right, I'm not sure why John omitted that (and comments are now closed) - perhaps it was because he is usually doing bandwidth measurements (where latency is often still the primary driver in a concurrency limited system) so he expected the row to already be open? In any case, I think you take the right approach by actually measuring to get a kind of "effective parallelism". — BeeOnRope, Jul 31 '17 at 16:28
The exact answer is not a single value: you may have a concurrency of up to ~60 to L1, then 10 between L1 and L3, up to 32 in the memory controller, and then 1 or 2 or something that over the DRAM bus (half real, half made up numbers) and this may all work out to an effective concurrency of 6 or whatever under actual testing. — BeeOnRope, Jul 31 '17 at 16:30

Parallel memory access on modern processors

1 Answers1

Linked