1

The load and store bandwidth here is defined as the capacity for data transferring from one level cache to another level per cycle. For example, the wikichip claims that the L2 cache has a 64 B/cycle bandwidth to L1$.

This is a good answer for calculating the L3 cache's overall bandwidth, but it assumes that each request involves a 64-byte transfer. I know it is 64B/cycle or 32B/cycle from wikichip, but I want to prove it.

The first trivial attempt I made is listed below. First, flush the cache and then try to load. Obviously, it failed, because it measures the time transferring from memory to cache.

for (int page = 0; page < length/512; page++)
{           
    asm volatile("mfence");
    for (int i = 0; i < 64; i++){
        flush(&shm[page*512+i*8]);
    }
    asm volatile("mfence");             
    int temp, l1;
    l1 = page*512 + 8*0;
    for (int i = 0; i < 64; i++){
        temp = shm[l1];
        l1 += 8;
    }
}

To fix this problem, I can use eviction sets, which make data reside only in the L3 cache. However, the fastest load-to-use time far outweighs the time transferring on the bus. For example, the fastest load-to-use time for the L3 cache is 42 cycles, while the L3 cache has a 32 B/cycle bandwidth to the L2 cache, which means that the bus won't become a bottleneck. This method seems impracticable.

Then I tried AVX2 listed below. The vmovntdq uses a non-temporal hint to prevent caching of the data during the write to memory. Every instruction stores 256 bytes. Besides, I assume that it will use the bus from L1 to L2, L2 to L3, and L3 to memory. I don't know if this assumption is reasonable. If it is, we can measure the bandwidth approximately. The smallest bandwidth among the three equals IPC*256 byte.

However, the IPC is from 0.09 to 0.10, which means that the CPU executes one vmovntdq every ten cycles. It can't reach the bottleneck of the bus. Fails again.

AvxLoops:
    push    rbp
    mov     rbp,rsp
    mov rax,2000
    vmovaps ymm0, [rsi] 
.loop:
    vmovntdq [rdi], ymm0
    vmovntdq [rdi+32], ymm0
    vmovntdq [rdi+64], ymm0
    vmovntdq [rdi+96], ymm0
    vmovntdq [rdi+128], ymm0
    vmovntdq [rdi+160], ymm0
    vmovntdq [rdi+192], ymm0
    vmovntdq [rdi+224], ymm0
    add rdi,32
    dec rax
    cmp rax,0
    jge .loop
    mov     rsp,rbp
    pop     rbp
    ret

Any good ideas? How to measure the load and store bandwidth (only in the bus) of the cache?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
moep0
  • 358
  • 1
  • 8
  • 1
    Wikichip is mis-stating that. The data-path between L1d and L2 is 64 bytes wide since SKL, but there isn't enough memory-level parallelism to keep the pipeline filled, or something like that. Similar to the effects that limit single-core bandwidth to quite a low [value on big Xeon chips](https://stackoverflow.com/questions/39260020/why-is-skylake-so-much-better-than-broadwell-e-for-single-threaded-memory-throug), especially Skylake-Xeon, lower than Broadwell-Xeon. See also the "latency-bound platforms section of [Enhanced REP MOVSB for memcpy](https://stackoverflow.com/a/43574756) – Peter Cordes Nov 11 '22 at 07:21
  • 1
    *Every instruction stores 256 bytes. Besides, I assume that it will use the bus from L1 to L2, L2 to L3, and L3 to memory* No, and no. Each instruction stores 256 **bits**, aka 32 bytes. And it only makes an off-core transaction when a write-combining buffer (LFB) completes a full 64 bytes, or if a partial NT store gets evicted to reclaim the LFB for something else. So you get full-line writes (with a no-RFO protocol) when you do contiguous NT stores, the use-case they're designed for. – Peter Cordes Nov 11 '22 at 07:27
  • 1
    (Related, [Does L1 cache accept new incoming requests while its Line Fill Buffers (LFBs) are fully exhausted?](https://stackoverflow.com/q/72201697) has an answer where I played around with partial-line NT stores and `perf`. Might be useful to understanding what you're trying to do.) But I don't know how to do anything that proves the data-path between L1d and L2 is 64 bytes in Skylake. Perhaps something with write-backs and loads could and up doing more total cache-line transfers than could be possible with 32-byte? But IDK how to prove the bus is full or half duplex. – Peter Cordes Nov 11 '22 at 07:31
  • 1
    I forgot to mention, I wondered if software-prefetch could maybe trigger some transfers between levels of cache without being as limited by the CPU core's max parallelism in waiting for incoming / outgoing cache lines (number of LFBs). Of course SW prefetches are just advisory, not mandatory, so you'd need a perf-counter way of measuring how many transfers per cycle were actually happening. – Peter Cordes Nov 12 '22 at 02:28
  • @PeterCordes Thanks for your valuable comments. Now I understand why NT stores don't work. As for prefetch, I have considered it before. But I found it difficult to measure transfers between levels of cache because it is hard to keep data all in L2 and none in L1. (In fact, the first example in the question description will trigger some hardware prefetches.) – moep0 Nov 12 '22 at 02:48
  • 1
    HW prefetch can be disabled via MSRs; sometimes useful for performance experiments. (And very occasionally, for specific workloads, to maybe selectively disable some of the prefetchers. I don't remember a specific example, though.) But yeah, that's the problem, activity of a core is what triggers things in lower levels. I guess SW `prefetcht1` could pull data only as far as L2, perhaps triggering dirty writeback of data from L2 to make room? But the messages to tell cache to do that still have to get there from the core. Another core reading data could also trigger L2 dirty write-back to L3 – Peter Cordes Nov 12 '22 at 02:54
  • 1
    I've read that Skylake runs x264 (video encoding) faster than Haswell in part before of the better bandwidth between L2 and L1d. Video encoders are careful about locality, and they write lots of temporary data as well as read. So it makes some sense that if there's anything that would benefit, it could be that kind of workload, with lots of read+write that misses in L1d and hits in L2. IDK if the sustained bandwidth ever gets above 32 bytes per cycle, but articles on tomshardware or anandtech or wherever I read that might well be correct about that being part of the reason. – Peter Cordes Nov 12 '22 at 02:58
  • @PeterCordes Thus I think hardware prefetches might be a better solution to measure the bandwidth, perhaps, because it doesn't need to execute an instruction like `prefetch1`. However, the Intel manual suggests that data prefetch only happens when 'The bus is not very busy'. Is there any method to measure the degree of busyness? – moep0 Nov 12 '22 at 03:14
  • 1
    There are counters for events like `l2_lines_out.non_silent` and `l2_lines_in.all` (but the latter might include lines coming in from L1, I don't know). So if you knew the bus width, those counters could tell you the average utilization... Unfortunately not super helpful if that's what you aim to measure. Or maybe *`offcore_requests.all_requests` - [Any memory transaction that reached the SQ]* (The SuperQueue tracks requests from L2 or the whole core, in the same way that LFBs track requests from L1d/execution units.) – Peter Cordes Nov 12 '22 at 03:18
  • 1
    @PeterCordes Sorry that I have spent some time on the links you provided. It seems SW prefetch still needs LFB according to [this question](https://stackoverflow.com/questions/19472036/does-software-prefetching-allocate-a-line-fill-buffer-lfb). Is there any other method to occupy the bus in a short period of time? – moep0 Nov 17 '22 at 02:45
  • Not that I know of. I'd hoped that `prefetcht1` or `t2` could "hand off" the prefetch to an outer level and free up the LFB sooner, but BeeOnRope's testing you linked doesn't seem to support that. :/ Good find, I'd forgotten about that Q&A. – Peter Cordes Nov 17 '22 at 02:51

0 Answers0