The Intel datasheets almost answer this question.
My first clue was a question on the Intel forums:
https://communities.intel.com/thread/110798
Jaehyuk.Lee, 01-Feb-2017 09:27 asked almost the same question as me:
The Second question is about simultaneous requests on IMC and its
support on brand-new CPU models such as skylake and kaby-lake
http://www.intel.com/Assets/PDF/datasheet/323341.pdf Following the
above link, "The memory controller can operate on up to 32
simultaneous requests(reads and writes)" I would like to know how many
simultaneous requests are supported in skylake and kabylake CPUs. I've
already checked the 6th and 7th generation of the Intel CPU datasheet,
but I cannot find any information.
The link is dead. But his "32" figure sounds plausible.
An Intel staff member responded, quoting from 6th Generation Intel® Processor Families for S-Platforms, Vol 1:
The memory controller has an advanced command scheduler where all
pending requests are examined simultaneously to determine the most
efficient request to be issued next. The most efficient request is
picked from all pending requests and issued to system memory
Just-in-Time to make optimal use of Command Overlapping. Thus, instead
of having all memory access requests go individually through an
arbitration mechanism forcing requests to be executed one at a time,
they can be started without interfering with the current request
allowing for concurrent issuing of requests. This allows for optimized
bandwidth and reduced latency while maintaining appropriate command
spacing to meet system memory protocol.
Annoyingly the datasheet for my Xeon E5-2670 v3 does not contain an equivalent section.
Another part of the answer is that the E5-2670 has 4 DDR channels. Memory is interleaved on a 256 byte granularity to optimize bandwidth. In other words, if you read a 1024 byte block from address 0, the first 256 bytes are fetched from DIMM 0. Bytes 256 to 511 are from DIMM 1 etc.
Putting the two together, I suspect that the memory controller can do 4 reads in parallel and is smart enough that if 4 or more threads are waiting on reads that map into 4 different DIMMs, it will do them in parallel. And it has enough hardware to keep up to about 32 reads/writes in its scheduling table.
I can think of another possible way to achieve parallelism. Each DDR channel has its own data and address buses. When the memory controller requests a read, it uses the address lines + some control lines to request the read and then waits for the response. For a random read, typically there are two waits - the RAS to CAS delay and CAS latency - about 15 cycles each. Instead of leaving the address lines idle, you could imagine the memory controller starting another read from a different DIMM (*) during these wait periods. I have no idea if this is done.
* Actually, according to this Anandtech article there is more parallelness in the DRAM hardware than just having multiple DIMMs per channel. Each DIMM might have multiple ranks and each rank has many banks. I think you could switch to any other rank and bank within a DIMM to perform another access in parallel.
EDIT
I measured that my machine can do at least 6 random accesses in parallel, despite only having 4 memory channels. So a single memory channel can perform 2 or more random accesses in parallel, perhaps using the scheme I described in the paragraph above.
To get this information, I used tinymembench to measure the latency of DRAM access on my machine. The result was 60 ns. I then wrote a small C program to perform 32-bit reads from a 1 GB table of random numbers and use the result to increment a checksum. Pseudo code:
uint32_t checksum = 0;
for (int i = 0; i < 256 * 1024 * 1024; i++) {
unsigned offset = rand32() & (TABLE_SIZE - 1);
checksum += table_of_random_numbers[offset];
}
Each iteration of the loop took 10 ns on average. This is because the out-of-order and speculative execution features in my CPU were able to parallelize this loop 6 times. ie 10 ns = 60 ns / 6.
If instead I replaced the code with:
unsigned offset = rand32() & (TABLE_SIZE - 1);
for (int i = 0; i < 256 * 1024 * 1024; i++) {
offset = table_of_random_numbers[offset];
offset &= (TABLE_SIZE - 1);
}
Then each iteration takes 60 ns, because the loop cannot be parallized. It cannot be parallized because the address of each access depends on the result of the previous read.
I also checked the assembly the compiler generated to make sure it hadn't done the parallizing.
EDIT 2
I decided to test what happens when I run multiple tests in parallel, each as a separate process. I used the program snippet above that includes the checksum (ie the one that appears to see a latency of 10 ns per access). By running 6 instances in parallel, I get an average apparent latency of 13.9 ns, meaning that about 26 accesses must be occurring in parallel. (60 ns / 13.9 ns) * 6 = 25.9.
6 instances was optimal. Any more caused the overall throughput to drop.
EDIT 3 - Responding the Peter Cordes RNG question
I tried two different random number generators.
uint32_t g_seed = 12345;
uint32_t fastrand() {
g_seed = 214013 * g_seed + 2531011;
return g_seed;
}
and
// *Really* minimal PCG32 code / (c) 2014 M.E. O'Neill / pcg-random.org
// Licensed under Apache License 2.0 (NO WARRANTY, etc. see website)
typedef struct { uint64_t state; uint64_t inc; } pcg32_random_t;
uint32_t pcg32_random_r(pcg32_random_t* rng)
{
uint64_t oldstate = rng->state;
// Advance internal state
rng->state = oldstate * 6364136223846793005ULL + (rng->inc|1);
// Calculate output function (XSH RR), uses old state for max ILP
uint32_t xorshifted = ((oldstate >> 18u) ^ oldstate) >> 27u;
uint32_t rot = oldstate >> 59u;
return (xorshifted >> rot) | (xorshifted << ((-rot) & 31));
}
They both performed about the same. I can't remember the exact numbers. The peak single threaded performance I saw was with the simpler RNG, and it gave me an amortised latency of 8.5 ns, implying 7 reads in parallel. The assembly for the timed loop was:
// Pseudo random number is in edx
// table is in rdi
// loop counter is in rdx
// checksum is in rax
.L8:
imull $214013, %edx, %edx
addl $2531011, %edx
movl %edx, %esi
movl %edx, g_seed(%rip)
andl $1073741823, %esi
movzbl (%rdi,%rsi), %esi
addq %rsi, %rax
subq $1, %rcx
jne .L8
ret
I don't understand "g_seed(%rip)". Is that a memory access? Why would the compiler do that?
EDIT 4 - Removed global variable from random number generator
I removed the global variable from the random number generator as suggested by Peter. The generated code was indeed cleaner. I also switched to Intel syntax for the disassembly (thanks for the tip).
// Pseudo random number is in edx
// table is in rdi
// loop counter is in rdx
// checksum is in rax
.L8:
imul edx, edx, 214013
add edx, 2531011
mov esi, edx
and esi, 1073741823
movzx esi, BYTE PTR [rdi+rsi]
add rax, rsi
sub rcx, 1
jne .L8
ret
The performance was unchanged though, both the single process and multi-process cases.