2

I am benchmarking the following copy function (not fancy!) with a high size argument (~1GB):

void copy(unsigned char* dst, unsigned char* src, int count)
{
    for (int i = 0; i < count; ++i)
    { 
         dst[i] = src[i];
    }
}

I built this code with GCC 6.2, with -O3 -march=native -mtune-native, on a Xeon E5-2697 v2.

Just for you to look at the assembly generated by gcc on my machine, I paste here the assembly generated in the inner loop:

movzx ecx, byte ptr [rsi+rax*1]
mov byte ptr [rdi+rax*1], cl
add rax, 0x1
cmp rdx, rax
jnz 0xffffffffffffffea

Now, as my LLC is ~25MB and I am copying ~1GB, it makes sense that this code is memory bound. perf confirms this with a high number of stalled frontend cycles:

        6914888857      cycles                    #    2,994 GHz                    
        4846745064      stalled-cycles-frontend   #   70,09% frontend cycles idle   
   <not supported>      stalled-cycles-backend   
        8025064266      instructions              #    1,16  insns per cycle        
                                                  #    0,60  stalled cycles per insn

My first question is about 0.60 stalled cycles per instruction. This seems like a very low number for such code that access LLC/DRAM all the time as the data is not cached. As LLC latency is 30cycles and main memory around 100 cycles, how is this achieved?

My second question is related; it seems that the prefetcher is doing a relatively good job (not surprising, it's an array, but still): we hit 60% of the time the LLC instead of DRAM. Still, what is the reason for it to fail the other times? Which bandwidth/part of the uncore made this prefetcher fails to accomplish its task?

          83788617      LLC-loads                                                    [50,03%]
          50635539      LLC-load-misses           #   60,43% of all LL-cache hits    [50,04%]
          27288251      LLC-prefetches                                               [49,99%]
          24735951      LLC-prefetch-misses                                          [49,97%]

Last, but not least: I know that Intel can pipeline instructions; is it also the case for such mov with memory operands?

Thank you very much!

Hadi Brais
  • 22,259
  • 3
  • 54
  • 95
AntiClimacus
  • 1,380
  • 7
  • 22
  • 2
    For question one: memory latency plays a lessened role when you're writing/reading that much sequential data; the throughput of the RAM is what's important. – IGarFieldI Jan 15 '19 at 00:22
  • 3
    I think you need restrict for vectorization to be able to kick in (but with restrict, my gcc recognizes this as memcpy and generates a jump to that function). The prefetch misses are probably due to crossing page boundaries. – Petr Skocik Jan 15 '19 at 00:44
  • 1
    @PSkocik: I am very surprised that gcc would recognize this function as equivalent to `memcpy` since the `count` argument has type `int` instead of `size_t`. It should at least generate a test such as `if (count <= 0) return;` or some other code to account for negative size values. – chqrlie Jan 15 '19 at 03:32
  • @IGarFieldI latency does not matter if you can pipeline all your instructions; although in this case on Intel Sandy Bridge there is a 14 stages pipeline, and even with the assumption that everything has been prefetched to the L3 cache, a L3 hit still costs around 30 cycles... well I guess this might be the answer actually, it's not too bad, especially as 30 cycles brings a whole cacheline which is in this case 64 iterations of the loop. Thanks! – AntiClimacus Jan 15 '19 at 07:59
  • 1
    @chqrlie: yes, of course compilers have to emit a test/branch in case `count` is signed negative, and on 64-bit ISAs, zero-extend it to `size_t`. PSKocik just meant that they don't actually auto-vectorize the loop, they use memcpy for the real work instead. https://godbolt.org/z/NGJi1e has gcc and clang output for x86-64, ARM, AArch64 and MIPS. – Peter Cordes Jan 16 '19 at 13:48
  • 1
    @AntiClimacus: how did you allocate your src and dst? If they've never been touched, all src pages will be copy-on-write mapped to the same physical zero page. And whether they're static or dynamic can maybe affect whether the OS uses transparent hugepages. – Peter Cordes Jan 16 '19 at 15:30
  • related: [Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?](https://stackoverflow.com/q/39260020): higher uncore latency hurts single-core memory bandwidth on many-core Xeons. I wouldn't have thought it would matter in a loop this slow (front-end bottleneck of 1 byte per 2 clocks), but your loop is 2x slower than *that*. Still, HW prefetching is doing pretty well, and only ~83M out of 1605M = 8025M/5 of your loads even reach LLC. That's a ~5% miss rate from L1d+L2 combined. – Peter Cordes Jan 16 '19 at 15:37
  • BTW, from the same asm loop on desktop Skylake (with both 1G buffers in the BSS so the src gets TLB misses but L1d cache hits from being COW mapped to the zero page), I get about twice the throughput, with an L1d miss rate of only 4%. (Skylake doesn't un-laminate indexed stores, so the loop's only 4 uops and can issue from the front end at 1 iter per clock, vs. 2 on ivB. [Micro fusion and addressing modes](https://stackoverflow.com/q/26046634) and [Is performance reduced when executing loops whose uop count is not a multiple of 4?](https://stackoverflow.com/a/53148682)) – Peter Cordes Jan 16 '19 at 15:39
  • @PeterCordes they are statically allocated and I have a loop that initialize every single element to avoid any CoW during the actual measurement. Could you share more on the transparent hugepages and how this differs if dynamic/statically allocated? – AntiClimacus Jan 16 '19 at 15:58
  • static data in the BSS can use transparent hugepages, but not `.data` or `.rodata` (non-zero-initialized static data). Those are file-backed (private) mappings, but IIRC transparent hugepages only works for anonymous mappings. You can always check by looking at /proc//smaps, and looking for the AnonHuge number in the relevant mapping. (I usually control-Z a microbench while it's running, or stop it at a breakpoint.) You may want to enable always_defrag or something. – Peter Cordes Jan 16 '19 at 16:28

2 Answers2

3

TL;DR: There are a total of 5 uops in the unfused domain (See: Micro fusion and addressing modes). The loop stream detector on Ivy Bridge cannot allocate uops across the loop body boundaries (See: Is performance reduced when executing loops whose uop count is not a multiple of processor width?), so it takes two cycles to allocate one iteration. The loop actually runs at 2.3c/iter on dual socket Xeon E5-2680 v2 (10 cores per socket vs. your 12), so that is close to the best that can be done given the front-end bottleneck.

The prefetchers have performed very well, and most of the time the loop is not memory-bound. Copying 1 byte per 2 cycles is very slow. (gcc did a poor job, and should have given you a loop that could run at 1 iteration per clock. Without profile-guided optimization, even -O3 doesn't enable -funroll-loops, but there are tricks it could have used (like counting a negative index up toward zero, or indexing the load relative to the store and incrementing a destination pointer) that would have brought the loop down to 4 uops.)

The extra .3 cycles per iteration slower than the front-end bottleneck on average is probably coming from stalls when prefetching fails (maybe at page boundaries), or maybe from page faults and TLB misses in this test that runs over statically-initialized memory in the .data section.


There are two data dependencies in the loop. First, the store instruction (specifically the STD uop) depends on the result of the load instruction. Second, both the store and load instructions depend on add rax, 0x1. In fact, add rax, 0x1 depends on itself as well. Since the latency of add rax, 0x1 is one cycle, an upper bound on performance of the loop is 1 cycle per iteration.

Since the store (STD) depends on the load, it cannot be dispatched from the RS until the load completes, which takes at least 4 cycles (in case of an L1 hit). In addition, there is only one port that can accept STD uops yet up to two loads can complete per cycle on Ivy Bridge (especially in the case the two loads are to lines that are resident in the L1 cache and no bank conflict occurs), resulting in additional contention. However, RESOURCE_STALLS.ANY shows that the RS actual never gets full. IDQ_UOPS_NOT_DELIVERED.CORE counts the number of issue slots that were not utilized. This is equal to 36% of all slots. The LSD.CYCLES_ACTIVE event shows that the LSD is used to deliver uops most of the time. However, LSD.CYCLES_4_UOPS/LSD.CYCLES_ACTIVE =~ 50% shows that in about 50% of the cycles, less than 4 uops are delivered to the RS. The RS will not get full because of the sub-optimal allocation throughput.

The stalled-cycles-frontend count corresponds to UOPS_ISSUED.STALL_CYCLES, which counts allocation stalls due to both frontend stalls and backend stalls. I don't understand how UOPS_ISSUED.STALL_CYCLES is related to the number of cycles and other events.

The LLC-loads count includes:

  • All demand load requests to the L3 irrespective of whether the request hits or misses in the L3 and, in case of a miss, irrespective of the source of data. This also includes demand load requests from the page walking hardware. It's not clear to me whether load requests from the next-page prefetcher are counted.
  • All hardware prefetch data read requests generated by an L2 prefetcher where the target line is to be placed in the L3 (i.e., in the L3 or both in the L3 and L2, but not only in the L2). Hardware L2 prefetcher data read requests where the line is to be placed only in the L2 are not included. Note that the L1 prefetchers' requests go to the L2 and influence and may trigger the L2 prefetchers, i.e., they don't skip the L2.

LLC-load-misses is a subset of LLC-loads and includes only those events that missed in the L3. Both are counted per core.

There is an important difference between counting requests (cache-line granularity) and counting load instructions or load uops (using MEM_LOAD_UOPS_RETIRED.*). Both the L1 and L2 caches squash load requests to the same cache line, so multiple misses in the L1 may result in a single request to the L3.

Optimal performance can be achieved if all stores and loads hit in the L1 cache. Since the size of a buffer you used is 1GB, the loop can cause a maximum of 1GB/64 =~ 17M L3 demand load requests. However, your LLC-loads measurement, 83M, is much larger, probably due to code other than the loop you've shown in the question. Another possible reason is that you forgot to use the :u suffix to count only user-mode events.

My measurements on both IvB and HSW show that LLC-loads:u is negligible compared to 17M. However, most of the L3 loads are misses (i.e., LLC-loads:u =~ LLC-loads-misses:u). CYCLE_ACTIVITY.STALLS_LDM_PENDING shows that the overall impact of loads on performance is negligible. In addition, my measurements show that the loop runs at 2.3c/iter on IvB (vs. 1.5c/iter on HSW), which suggests that one load is issued every 2 cycles. I think that the sub-optimal allocation throughput is the main reason for this. Note that 4K aliasing conditions (LD_BLOCKS_PARTIAL.ADDRESS_ALIAS) are almost non-existent. All of this means that the prefetchers have done a pretty good job at hiding the memory access latency for most loads.


Counters on IvB that can be used to evaluate the performance of hardware prefetchers:

Your processor has two L1 data prefetchers and two L2 data prefetchers (one of them can prefetch both into the L2 and/or the L3). A prefetcher may not be effective for the following reasons:

  • A triggering condition has not been satisfied. This is typically because an access pattern has not been recognized yet.
  • The prefetcher has been triggered but the prefetch was to a useless line.
  • The prefetcher has been triggered to a useful line but the line got replaced before being used.
  • The prefetcher has been triggered to a useful line but the demand requests have already reached the cache and missed. This means that the demand requests were issued faster than the ability of the prefetcher to react in a timely manner. This can happen in your case.
  • The prefetcher has been triggered to a useful line (that doesn't exist in the cache), but the request had to be dropped because no MSHR was available to hold the request. This can happen in your case.

The number of demand misses at the L1, L2, and L3 are good indicators of how well the prefetchers have performed. All the L3 misses (as counted by LLC-load-misses) are also necessarily L2 misses, so the number of L2 misses is larger than LLC-load-misses. Also all of the demand L2 misses are necessarily L1 misses.

On Ivy Bridge, you can use the LOAD_HIT_PRE.HW_PF and CYCLE_ACTIVITY.CYCLES_* performance events (in addition to the miss events) to know more about how the prefetchers have performed and evaluate their impact on performance. It's important to measure CYCLE_ACTIVITY.CYCLES_* events because even if the miss counts were seemingly high, that doesn't necessarily mean that misses are the main cause of performance degradation.

Note that the L1 prefetchers cannot issue speculative RFO requests. Therefore, most writes that reach the L1 will actually miss, requiring the allocation of an LFB per cache line at the L1 and potentiality other levels.


The code I used follows.

BITS 64
DEFAULT REL

section .data
bufdest:    times COUNT db 1 
bufsrc:     times COUNT db 1

section .text
global _start
_start:
    lea rdi, [bufdest]
    lea rsi, [bufsrc]

    mov rdx, COUNT
    mov rax, 0

.loop:
    movzx ecx, byte [rsi+rax*1]
    mov byte [rdi+rax*1], cl
    add rax, 1
    cmp rdx, rax
    jnz .loop

    xor edi,edi
    mov eax,231
    syscall
Hadi Brais
  • 22,259
  • 3
  • 54
  • 95
  • If you look at **83M LLC-loads total** vs. 6914M cycles, or 8025M instructions (or that / 5 = **1605M loads total**), the vast majority of loads hit in L1d or L2. Of those that get past L2, 60% also miss in L3. IDK if I'd say there are "many" L3 misses. But the loop is running surprisingly slowly, at about 1 store per 4.3 clocks. Which is garbage even compared to the 1 iteration per 2 clock front-end bottleneck (un-lamination of the indexed store makes it 5 uops, and it's IvB where the loop-branch can't issue along with uops from the next iteration.) – Peter Cordes Jan 16 '19 at 14:26
  • I thought IvB's next-page prefetcher would do a much better job here. (It's new in IvB. [When should we use prefetch?](https://stackoverflow.com/a/20758769) mentions this. I don't know if it's based on virtual or physical.) I wonder how much of the total time is in actual page faults here (depending on kernel version / config it might not fault in and map very many pages of destination memory), and/or TLB misses? But with 4096 iterations per page, you'd hope it would amortize better than with an efficient copy. – Peter Cordes Jan 16 '19 at 14:31
  • 1
    @PeterCordes I'm not able to reproduce the OP's results on Ivy Bridge. I've allocated the two buffers in the `.data` section (initialized all bytes to 1). In this case, the code runs at 2.3c/iter, which I think is good. Also `CYCLE_ACTIVITY.STALLS_LDM_PENDING` is only 3% of total cycles. That said, we don't know what the OP has actually measured. Dividing 8025M by 5 may not be an accurate estimate of the performance per iteration because the OP might have measured other code not shown in the question. Anyway, I do think that most loads hit in the L1. – Hadi Brais Jan 16 '19 at 16:53
  • @PeterCordes The OP mentioned that the buffer size is 1GB, so there are about a billion 1-byte load and a billion 1-byte store instructions. However, at L1, if there are multiple misses to the same line, they get squashed into a single request to the L2. Yes, the number of LLC load misses is 83M but this is not out of 1 billion (or 1605M), it should be compared to the total number of lines loaded from, which is 1GB/64=~15M. But it doesn't make sense that 83M > 15M. BTW, on my IvB system, 'perf' says that all `MEM_UOPS_RETIRED.*` events are not supported, not sure why. – Hadi Brais Jan 16 '19 at 17:30
  • 1
    On HSW, the total number of L3 loads is negligible. – Hadi Brais Jan 16 '19 at 17:31
  • Oh right, good point about the math, I'm off by a factor of 64 because of multiple misses to the same line. :P Re: total number of instructions: [OP says they're initializing the arrays](https://stackoverflow.com/questions/54190958/throughput-analysis-on-memory-copy-benchmark/54206969?noredirect=1#comment95267424_54190958), so maybe that's where some of the total 8025M instructions come from. I was using that because I thought maybe they were re-running the copy in a loop. But I think we just need to know more about the code that feeds that loop. – Peter Cordes Jan 16 '19 at 17:37
  • 1
    @PeterCordes BTW, the `perf` `LLC-*` events are mapped to `OFFCORE_RESPONSE.*` events. These events require programming certain MSR registers to be used. Looking at the [source code](https://elixir.bootlin.com/linux/v4.16.3/source/arch/x86/events/intel/core.c#L577). I'm not sure that `LLC-loads` counts only demand loads, and not prefetch loads. My understanding is that these events count requests, unlike `MEM_UOPS_RETIRED.*`, which count uops. On my IvB, `LLC-loads` is about 50% higher than the number of lines loaded and 62% of them are `LLC-load-misses`. – Hadi Brais Jan 16 '19 at 17:58
  • Hmm, I just had an idea for using `perf stat` while still touching memory first get page faults and COW out of the way: you can init the memory in a separate process, then use some kind of shared-memory technique to have the new process map it or inherit the mapping across a `clone` system call. Or I guess maybe you'd just want to fork+exec a `perf` command that attaches to the current process, because you still need to start `perf` after init. So not actually a new idea, just that the OP should definitely do that here. – Peter Cordes Jan 16 '19 at 23:58
  • 1
    @PeterCordes Yea I think doing that may significantly reduce the number of page faults. BTW, I was wrong earlier about the RS getting full. I've edited the answer to correct it. – Hadi Brais Jan 17 '19 at 00:06
  • 1
    IDK if the OP tested something other than than you or something, or if a 12-core Xeon with higher uncore latency made a huge difference (and probably 2 sockets with NUMA, otherwise they'd probably have an E3 Xeon instead of E5), but yeah coming close the front-end bottleneck makes a *lot* more sense. More likely the OP's numbers included a lot of init work that threw off my cycles/iter calculation, and in the actual copy loop they're also getting close to 2c per iter for this case where HW prefetch should have the easiest time possible keeping up. – Peter Cordes Jan 17 '19 at 00:36
  • Your IvB test was with a desktop or laptop, I'm guessing? Or did you also have access to a 2-socket Xeon like the OP's https://ark.intel.com/products/75283/Intel-Xeon-Processor-E5-2697-v2-30M-Cache-2-70-GHz- – Peter Cordes Jan 17 '19 at 00:40
  • 1
    @PeterCordes It's a dual socket E5-2680 v2. BTW, IACA estimates a throughput of 1.24, way off. The bottomline I guess is that there are a total of 5 uops, the loop stream detector will allocate 4 of them in one cycle and 1 in the next cycle, so an upper bound on performance is 2c/iter, pretty close to 2.3c/iter. – Hadi Brais Jan 17 '19 at 00:52
  • 2
    Heh, IACA strikes again. It probably doesn't know about how SnB/IvB's LSD differ from HSW and later, or doesn't model LSD effects in the front-end at all. [Is performance reduced when executing loops whose uop count is not a multiple of processor width?](https://stackoverflow.com/a/53148682) shows that an issue group can't include uops from before and after the loop branch until HSW. i.e. that it was a new feature in HSW to do something equivalent to unrolling tiny loops in the LSD, mitigating the multiple-of-4 effect. – Peter Cordes Jan 17 '19 at 00:57
  • @PeterCordes Do you know of any good Q/A that I can refer in my answer to where it explains the unalamination of indexed stores? – Hadi Brais Jan 17 '19 at 01:02
  • So the OP has a 12-core vs. your 10-core per socket, so that's probably not a significant difference. Looking more and more likely that their `perf` numbers are misleading, unless initialized static data is way different from the BSS. Or maybe if a snoop-mode setting can make a big difference. Or if they have a BIOS setting that disabled some prefetching. Or possibly if they don't have all their DRAM sockets populated, but that's a stretch. Unless maybe their process was using RAM on the other socket, maybe if they have RAM only on one of their sockets. But still prefetch should hide that. – Peter Cordes Jan 17 '19 at 01:04
  • [Micro fusion and addressing modes](https://stackoverflow.com/q/26046634) is a pretty good canonical including stores as an example for SnB vs. HSW, and I think it's also documented in the optimization manual. – Peter Cordes Jan 17 '19 at 01:04
  • 1
    @Peter - one way to exclude warmup is just to use the --delay argument of perf to start after the warmup. I just print out the time since process start with `clock()` or similar so I know what argument to use. It's not 100% precise but if you want you can make it even more precise by putting a hot or cold sleep right before the region of interest (after printing the time) so you have a larger target to "hit" with the delay argument (so you don't miss any events of interest). – BeeOnRope Jan 27 '19 at 19:36
  • @BeeOnRope Have you seen my update to [this](https://stackoverflow.com/questions/25774190/l1-memory-bandwidth-50-drop-in-efficiency-using-addresses-which-differ-by-4096)? I was not able to tag you there. Both loads actually do alias. You can remove it from your performance quirks list on GitHub. – Hadi Brais Jan 27 '19 at 19:46
  • @HadiBrais - I hadn't seen it. So we just got confused about the aliasing math or the simulator had a bug or something? – BeeOnRope Feb 01 '19 at 03:11
  • @BeeOnRope Yes that was just my mistake, I got the addresses wrong. Sorry about the confusion. It's clear now that both loads should alias. I've cleaned up the whole answer accordingly. – Hadi Brais Feb 01 '19 at 16:45
2

My first question is about 0.60 stalled cycles per instruction. This seems like a very low number for such code that access LLC/DRAM all the time as the data is not cached. As LLC latency is 30cycles and main memory around 100 cycles, how is this achieved?

My second question is related; it seems that the prefetcher is doing a relatively good job (not surprising, it's an array, but still): we hit 60% of the time the LLC instead of DRAM. Still, what is the reason for it to fail the other times? Which bandwidth/part of the uncore made this prefetcher fails to accomplish its task?

With prefetchers. Specifically, depending on which CPU it is, there may be a "TLB prefetcher" fetching virtual memory translations, plus a cache line prefetcher that's fetching data from RAM into L3, plus an L1 or L2 prefetcher fetching data from L3.

Note that caches (e.g. L3) work on physical addresses, its hardware prefetcher works on detecting and prefetching sequential accesses to physical addresses, and because of virtual memory management/paging the physical accesses are "almost never" sequential at page boundaries. For this reason the prefetcher stops prefetching at page boundaries and probably takes three "non-prefetched" accesses to start prefetching from the next page.

Also note that if RAM was slower (or the code was faster) the prefetcher wouldn't be able to keep up and you'd stall more. For modern multi-core machines, the RAM is often fast enough to keep up with one CPU, but can't keep up with all CPUs. What this means is that outside of "controlled testing conditions" (e.g. when the user is running 50 processes at the same time and all CPUs are pounding RAM) your benchmark will be completely wrong. There are also things like IRQs, task switches and page faults that can/will interfere (especially when the computer is under load).

Last, but not least: I know that Intel can pipeline instructions; is it also the case for such mov with memory operands?

Yes; but a normal mov involving memory (e.g. mov byte ptr [rdi+rax*1], cl) will also be restricted by the "write ordered with store forwarding" memory ordering rules.

Note that there's many ways to speed up the copy, including using non-temporal stores (to deliberately break/bypass the memory ordering rules), using rep movs (which is specially optimised to work on whole cache lines where possible), using much larger pieces (e.g. AVX2 copying 32 bytes at a time), doing the prefetching yourself (especially at page boundaries), and doing cache flushing (so that caches still contain useful things after the copy has been done).

However, it's far better to do the opposite - deliberately make large copies very slow, so that the programmer notices that they suck and are "forced" to try to find a way to avoid doing the copy. It can cost 0 cycles to avoid copying 20 MiB, which is significantly faster than the "least worst" alternative.

Community
  • 1
  • 1
Brendan
  • 35,656
  • 2
  • 39
  • 66
  • IvB introduced a next-page prefetcher. Not sure if that's next *virtual* page or next *physical* page, but see also [When should we use prefetch?](https://stackoverflow.com/a/20758769). On IvB, the OP's loop runs at *best* 1 iter per 2 cycles, because it's 5 total uops. (Un-lamination of the indexed store). I expect most of the time it will hit in L1d, and the 60% L3 hit rate is only out of loads that get past L2. – Peter Cordes Jan 16 '19 at 14:16
  • 1
    @PeterCordes: The "next page prefetcher" is the "TLB prefetcher" I mentioned - it fetches the virtual to physical translation for the next page (and doesn't fetch any of the contents of the virtual or physical page - the contents are still pre-fetched by "cache line pre-fetcher/s"); and the "next page prefetcher"/TLB prefetcher" probably also has its own limits (e.g. might not prefetch if you cross a 1 GiB, 512 GiB or something else boundary, where a different page directory or page directory pointer table is involved). – Brendan Jan 17 '19 at 07:00
  • Note that I rarely care about the exact number of cycles on a specific model of CPU from a specific manufacturer under specific conditions. This is almost always pointless (e.g. if your software will be released in 5 years, then maybe you should care about CPUs that won't exist for 5 years); and by optimizing for what you happen to have now (e.g. with profiler) there's a danger of making performance worse (on whatever users have). Unless you're writing a compiler back-end; ignore the irrelevant model specific nonsense and rely on generic information that applies to a large variety of CPUs. – Brendan Jan 17 '19 at 07:11
  • Oh, so there's probably still no mechanism to prefetch *data* from across a 4k boundary (other than SW prefetch or having enough load buffers for a demand-load to execute that far ahead of where data is arriving at the core)? It's been a while since I looked at HW prefetch details, but I seem to recall most of them normally stop at 4k boundaries, rather than optimistically continuing into the next 4k in hopes of being inside a hugepage or contiguous physical pages. – Peter Cordes Jan 17 '19 at 07:32
  • @PeterCordes: I'm not sure; but I suspect not- it'd be hard for "LLC prefetcher shared by many CPUs" to depend on "per logical CPU TLBs", much easier to keep the pieces independent, and not likely to help a lot anyway (with 4 KiB pages and 64 byte cache lines, it'd be less than 5% of "not prefetched"). – Brendan Jan 17 '19 at 14:07
  • Also note that all these prefetchers can over-fetch, and each has pathological cases. For example, imagine a linked list of 192 byte (3 cache line) structures, where hardware prefetcher won't prefetch what you want (because of "3 accesses to start") and may then waste bandwidth trying to prefetch data you don't want (beyond the end of the structure). – Brendan Jan 17 '19 at 14:14
  • The main prefetchers are in private per-core L2 cache on Intel, not in shared L3. Each physical core can track 1 forward and 1 backward stream per 4k page. That doesn't give easy access to L1dTLB either, but it's at least plausible. And there is some prefetching into L1, but using virtual addresses would cost extra TLB lookups which costs power and either read ports or competes with real loads/stores. – Peter Cordes Jan 17 '19 at 21:05