The load and store bandwidth here is defined as the capacity for data transferring from one level cache to another level per cycle. For example, the wikichip claims that the L2 cache has a 64 B/cycle bandwidth to L1$.
This is a good answer for calculating the L3 cache's overall bandwidth, but it assumes that each request involves a 64-byte transfer. I know it is 64B/cycle or 32B/cycle from wikichip, but I want to prove it.
The first trivial attempt I made is listed below. First, flush the cache and then try to load. Obviously, it failed, because it measures the time transferring from memory to cache.
for (int page = 0; page < length/512; page++)
{
asm volatile("mfence");
for (int i = 0; i < 64; i++){
flush(&shm[page*512+i*8]);
}
asm volatile("mfence");
int temp, l1;
l1 = page*512 + 8*0;
for (int i = 0; i < 64; i++){
temp = shm[l1];
l1 += 8;
}
}
To fix this problem, I can use eviction sets, which make data reside only in the L3 cache. However, the fastest load-to-use time far outweighs the time transferring on the bus. For example, the fastest load-to-use time for the L3 cache is 42 cycles, while the L3 cache has a 32 B/cycle bandwidth to the L2 cache, which means that the bus won't become a bottleneck. This method seems impracticable.
Then I tried AVX2 listed below. The vmovntdq
uses a non-temporal hint to prevent caching of the data during the write to memory. Every instruction stores 256 bytes. Besides, I assume that it will use the bus from L1 to L2, L2 to L3, and L3 to memory. I don't know if this assumption is reasonable. If it is, we can measure the bandwidth approximately. The smallest bandwidth among the three equals IPC*256 byte.
However, the IPC is from 0.09 to 0.10, which means that the CPU executes one vmovntdq
every ten cycles. It can't reach the bottleneck of the bus. Fails again.
AvxLoops:
push rbp
mov rbp,rsp
mov rax,2000
vmovaps ymm0, [rsi]
.loop:
vmovntdq [rdi], ymm0
vmovntdq [rdi+32], ymm0
vmovntdq [rdi+64], ymm0
vmovntdq [rdi+96], ymm0
vmovntdq [rdi+128], ymm0
vmovntdq [rdi+160], ymm0
vmovntdq [rdi+192], ymm0
vmovntdq [rdi+224], ymm0
add rdi,32
dec rax
cmp rax,0
jge .loop
mov rsp,rbp
pop rbp
ret
Any good ideas? How to measure the load and store bandwidth (only in the bus) of the cache?