Dear Stackoverflow Community,
I am trying to understand the calculation of performance limits for DRAM access but my benchmarks for achieving these limits are not very close to the numbers that can be found in the specs. One would not expect to reach a theoretical limit, of course, but there might be some explanations why this is so far off.
E.g. I do measure around 11 GB/s for DRAM access on my system, but WikiChip or the JEDEC spec list the peak performance for a dual channel DDR4-2400 system at 38.4 GB/s.
Is my measurement flawed or are these just not the right numbers to calculate peak memory performance?
The Measurement
On my system with a core i7 8550u at 1.8GHz from the (Kaby Lake Microarchitecture)
it is the case, that lshw
shows two memory
entries
*-memory
...
*-bank:0
...
slot: ChannelA-DIMM0
width: 64 bits
clock: 2400MHz (0.4ns)
*-bank:1
...
slot: ChannelB-DIMM0
width: 64 bits
clock: 2400MHz (0.4ns)
so these two should run in "dual channel" mode then (is that automatically the case?).
I set up the system to reduce the measurement noise with
- disabling frequency scaling
- disabling Address Space Layout Randomization
- setting
scaling_governor
toperformance
- using
cpuset
to isolate the benchmark on an own core - setting a niceness of -20
- using a headless system with a minimal amount of processes running
Then I started out with the ScanWrite256PtrUnrollLoop
benchmark of the pmbw - Parallel Memory Bandwidth Benchmark / Measurement program:
pmbw -f ScanWrite256PtrUnrollLoop -p 1 -P 1
The inner loop can be examined with
gdb -batch -ex "disassemble/rs ScanWrite256PtrUnrollLoop" `which pmbw` | c++filt
It seems that this benchmark creates a "stream" of vmovdqa
Move Aligned Packed Integer Values AVX256-instructions to saturate the CPU's memory subsystem
<+44>:
vmovdqa %ymm0,(%rax)
vmovdqa %ymm0,0x20(%rax)
vmovdqa %ymm0,0x40(%rax)
vmovdqa %ymm0,0x60(%rax)
vmovdqa %ymm0,0x80(%rax)
vmovdqa %ymm0,0xa0(%rax)
vmovdqa %ymm0,0xc0(%rax)
vmovdqa %ymm0,0xe0(%rax)
vmovdqa %ymm0,0x100(%rax)
vmovdqa %ymm0,0x120(%rax)
vmovdqa %ymm0,0x140(%rax)
vmovdqa %ymm0,0x160(%rax)
vmovdqa %ymm0,0x180(%rax)
vmovdqa %ymm0,0x1a0(%rax)
vmovdqa %ymm0,0x1c0(%rax)
vmovdqa %ymm0,0x1e0(%rax)
add $0x200,%rax
cmp %rsi,%rax
jb 0x37dc <ScanWrite256PtrUnrollLoop(char*, unsigned long, unsigned long)+44>
As a similar benchmark in Julia I came up with the following:
const C = NTuple{K,VecElement{Float64}} where K
@inline function Base.fill!(dst::Vector{C{K}},x::C{K},::Val{NT} = Val(8)) where {NT,K}
NB = div(length(dst),NT)
k = 0
@inbounds for i in Base.OneTo(NB)
@simd for j in Base.OneTo(NT)
dst[k += 1] = x
end
end
end
When investigating the inner loop of this fill!
function
code_native(fill!,(Vector{C{4}},C{4},Val{16}),debuginfo=:none)
we can see that this also creates a similar "stream" of vmovups
Move Unaligned Packed Single-Precision Floating-Point Values instructions:
L32:
vmovups %ymm0, -480(%rcx)
vmovups %ymm0, -448(%rcx)
vmovups %ymm0, -416(%rcx)
vmovups %ymm0, -384(%rcx)
vmovups %ymm0, -352(%rcx)
vmovups %ymm0, -320(%rcx)
vmovups %ymm0, -288(%rcx)
vmovups %ymm0, -256(%rcx)
vmovups %ymm0, -224(%rcx)
vmovups %ymm0, -192(%rcx)
vmovups %ymm0, -160(%rcx)
vmovups %ymm0, -128(%rcx)
vmovups %ymm0, -96(%rcx)
vmovups %ymm0, -64(%rcx)
vmovups %ymm0, -32(%rcx)
vmovups %ymm0, (%rcx)
leaq 1(%rdx), %rsi
addq $512, %rcx
cmpq %rax, %rdx
movq %rsi, %rdx
jne L32
Now, all these benchmarks somehow show the different "performance-plateaus" for the three caches and the main memory:
but interestingly, they are all bound to around 11 GB/s for bigger test sizes:
using multiple threads and (re)activating the frequency scaling (which doubles the CPU's frequency) has an impact on the smaller testsizes but does not really change these findings fort the larger ones.