In which condition DCU prefetcher start prefetching?

Question

I am reading about different prefetcher available in Intel Core i7 system. I have performed experiments to understand when these prefetchers are invoked.

These are my findings

L1 IP prefetchers starts prefetching after 3 cache misses. It only prefetch on cache hit.
L2 Adjacent line prefetcher starts prefetching after 1st cache miss and prefetch on cache miss.
L2 H/W (stride) prefetcher starts prefetching after 1st cache miss and prefetch on cache hit.

I am not able to understand the behavior of DCU prefetcher. When it starts prefetching or invoked ? Does it prefetch next cache line on cache hit or miss ?

I have explored intel document disclosure-of-hw-prefetcher where it mentioned - DCU prefetcher fetches the next cache line into L1-D cache , but no clear information when it starts prefetching .

Can anyone explain when DCU prefetcher prefetch starts prefetching?

Are asking about what Intel calls the DCU prefetcher in the manual? There is no such a thing as the L1 adjacent line prefetcher in any of Intel processors. — Hadi Brais, Nov 29 '18 at 00:00
According to this link https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors , DCU Prefetcher Fetches the next cache line into L1-D cache. — bholanath, Nov 30 '18 at 06:51
Well that was a little confusing because the "adjacent line prefetcher" term and the "DCU prefetcher" term have different specific meanings. Peter's answer would have been correct if you meant the adjacent line prefetcher. You should probably edit the question to use the DCU prefetcher term instead. — Hadi Brais, Nov 30 '18 at 07:50
Can you share your results and tests for the other three prefetching? — BeeOnRope, Nov 30 '18 at 22:18
@bholanath Hi, I'm also doing an experiment on the L1 IP prefetchers, what does "It only prefetch on cache hit" mean? At the same time, I use an access pattern of four cache misses with a constant stride, but I didn't see any more cache lines to be prefetched. So can you share your first test code to let me see what happens? Or can you help have a look at my [test code](https://github.com/zyma-Jasper/StridePrefetchTest)? — JasperMa, Feb 24 '21 at 23:39

Hadi Brais · Accepted Answer · 2019-05-20T20:32:42.643

The DCU prefetcher does not prefetch lines in a deterministic manner. It appears to have a confidence value associated with each potential prefetch request. If the confidence is larger than some threshold only then is the prefetch triggered. Moreover, it seems that if both L1 prefetchers are enabled, only one of them can issue a prefetch request in the same cycle. Perhaps the prefetch from the one with higher confidence is accepted. The answer below does not take these observations into consideration. (A lot more experimentation work needs to be done. I will rewrite it in the future.)

The Intel manual tells us a few things about the DCU prefetcher. Section 2.4.5.4 and Section 2.5.4.2 of the optimization manual both say the following:

Data cache unit (DCU) prefetcher -- This prefetcher, also known as the streaming prefetcher, is triggered by an ascending access to very recently loaded data. The processor assumes that this access is part of a streaming algorithm and automatically fetches the next line.

Note that Section 2.4.5.4 is part of the section on Sandy Bridge and Section 2.5.4.2 is part of the section on Intel Core. The DCU prefetcher was first supported on the Intel Core microarchitecture and it's also supported on all later microarchitectures. There is no indication as far as I know that the DCU prefetcher have changed over time. So I think it works exactly the same on all microarchitectures up to Skylake at least.

That quote doesn't really say much. The "ascending access" part suggests that the prefetcher is triggered by multiple accesses with increasing offsets. The "recently loaded data" part is vague. It may refer to one or more lines that immediately precede the line to be prefetched in the address space. It's also not clear whether that refers to virtual or physical addresses. The "fetches the next line" part suggests that it fetches only a single line every time it's triggered and that line is the line that succeeds the line(s) that triggered the prefetch.

I've conducted some experiments on Haswell with all prefetchers disabled except for the DCU prefetcher. I've also disabled hyperthreading. This enables me to study the DCU prefetcher in isolation. The results show the following:

The DCU prefetcher tracks accesses for up to 4 different 4KB (probably physical) pages.
The DCU prefetcher gets triggered when there are three or more accesses to one or more lines within the same cache set. The accesses must be either demand loads or software prefetches (any prefetch instruction including prefetchnta) or a combination of both. The accesses can be either hits or misses in the L1D or a combination of both. When it's triggered, for the 4 pages that are currently being tracked, it will prefetch the immediate next line within each of the respective pages. For example, consider the following three demand load misses: 0xF1000, 0xF2008, and 0xF3004. Assume that the 4 pages being tracked are 0xF1000, 0xF2000, 0xF3000, and 0xF4000. Then the DCU prefetcher will prefetch the following lines: 0xF1040, 0xF2040, 0xF3040, and 0xF4040.
The DCU prefetcher gets triggered when there are three or more accesses to one or more lines within two consecutive cache sets. Just like before, the accesses must be either demand loads or software prefetches. The accesses can be either hits or misses in the L1D. When it's triggered, for the 4 pages that are currently being tracked, it will prefetch the immediate next line within each of the respective pages with respect to the accessed cache set that has a smaller physical address. For example, consider the following three demand load misses: 0xF1040, 0xF2048, and 0xF3004. Assume that the 4 pages being tracked are 0xF1000, 0xF2000, 0xF3000, and 0xF4000. Then the DCU prefetcher will prefetch the following lines: 0xF3040 and 0xF4040. There is no need to prefetch 0xF1040 or 0xF2040 because there are already requests for them.
The prefetcher will not prefetch into the next 4KB page. So if the three accesses are to the last line in the page, the prefetcher will not be triggered.
The pages to be tracked are selected as follows. Whenever a demand load or a software prefetch accesses a page, that page will be tracked and it will replace one of the 4 pages currently being tracked. I've not investigated further the algorithm used to decide which of the 4 pages to replace. It's probably simple though.
When a new page gets tracked because of an access of the type mentioned in the previous bullet point, at least two more accesses are required to the same page and same line to trigger the prefetcher to prefetch the next line. Otherwise, a subsequent access to the next line will miss in the L1 if the line was not already there. After that, either way, the DCU prefetcher behaves as described in the second and third bullet points. For example, consider the following three demand load misses: 0xF1040, 0xF2048, and 0xF3004. There are two accesses to the same line and the third one is to the same cache set but different line. These accesses will make the DCU prefetcher track the two pages, but it will not trigger it just yet. When the prefetcher sees another three accesses to any line in the same cache set, it will prefetch the next line for those pages that are currently being tracked. As another example, consider the following three demand load misses: 0xF1040, 0xF2048, and 0xF3030. These accesses are all to the same line so they will not only make the prefetcher track the page but also trigger a next line prefetch for that page and any other pages that are already being tracked.
It seems to me that the prefetcher is receiving the dirty flag from the page table entry of the page being accessed (from the TLB). The flag indicates whether page is dirty or not. If it's dirty, the prefetcher will not track the page and accesses to the page will not be counted towards the three accesses for the triggering condition to be satisfied. So it seems that the DCU prefetcher simply ignores dirty pages. That said, the page doesn't have to be read-only though to be supported by the prefetcher. However, more thorough investigation is required to understand more accurately how stores may interact with the DCU prefetcher.

So the accesses that trigger the prefetcher don't have to be "ascending" or follow any order. The cache line offset itself seems to be ignored by the prefetcher. Only the physical page number matters.

I think the DCU prefetcher has a fully associative buffer that contains 4 entries. Each entry is tagged with the (probably physical) page number and has a valid bit to indicate whether the entry contains a valid page number. In addition, each cache set of the L1D is associated with a 2-bit saturating counter that is incremented whenever a demand load or a software prefetch request accesses the corresponding cache set and the dirty flag of the accessed page is not set. When the counter reaches a value of 3, the prefetcher is triggered. The prefetcher already has the physical page numbers from which it needs to prefetch; it can obtain them from the buffer entry that corresponds to the counter. So it can immediately issue prefetch requests to the next cache lines for each of the pages being tracked by the buffer. However, if a fill buffer is not available for a triggered prefetch request, the prefetch will be dropped. Then the counter will be reset to zero. Page tables might be modified though. It's possible that the prefetcher flushes its buffer whenever the TLB is flushed.

It could be the case that there are two DCU prefetchers, one for each logical core. When hyperthreading is disabled, one of the prefetchers would be disabled too. It could also be the case the 4 buffer entries that contain the page numbers are statically partitioned between the two logical cores and combined when hyperthreading is disabled. I don't know for sure, but such design makes sense to me. Another possible design would be each prefetcher has a dedicated 4-entry buffer. It's not hard to determine how the DCU prefetcher works when hyperthreading is enabled. I just didn't spend the effort to study it.

All in all, the DCU pefetcher is by far the simplest among the 4 data prefetchers that are available in modern high-performance Intel processors. It seems that it's only effective when sequentially, but slowly, accessing small chunks of read-only data (such as read-only files and statically initialized global arrays) or accessing multiple read-only objects at the same time that may contain many small fields and span a few consecutive cache lines within the same page.

Section 2.4.5.4 also provides additional information on L1D prefetching in general, so it applies to the DCU prefetcher.

Data prefetching is triggered by load operations when the following conditions are met:

Load is from writeback memory type.

This means that the DCU prefetcher will not track accesses to the WP and WT cacheable memory types.

The prefetched data is within the same 4K byte page as the load instruction that triggered it.

This has been verified experimentally.

No fence is in progress in the pipeline.

I don't know what this means. See: https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/805373.

Not many other load misses are in progress.

There are only 10 fill buffers that can hold requests that missed the L1D. This raises the question though that if there was only a single available fill buffer, would the hardware prefetcher use it or leave it for anticipated demand accesses? I don't know.

There is not a continuous stream of stores.

This suggests that if there is a stream of a large number of stores intertwined with few loads, the L1 prefetcher will ignore the loads and basically temporarily switch off until the stores become a minority. However, my experimental results show that even a single store to a page will turn the prefetcher off for that page.

All Intel Atom microarchitectures have the DCU prefetcher. Although the prefetcher might track less than 4 pages in these microarchitectures.

All Xeon Phi microarchitectures up to and including Knights Landing don't have the DCU prefetcher. I don't know about later Xeon Phi microarchitectures.

*No fence is in progress in the pipeline.* I assume that means no StoreLoad barrier (`mfence` or `lock`ed instruction) is in flight, waiting for all pending stores to commit to L1d. It's maybe not as useful to do load prefetches if there's a StoreLoad barrier pending because the potentially-stale data may have to be re-fetched to satisfy the barrier semantics. And it could cause extra contention; barriers are normally only used in code that interacts with other threads. — Peter Cordes, Nov 30 '18 at 08:07
Thank you @Hadi Brais for explaining in details. I am accepting your answer. You said - The DCU prefetcher gets triggered when there are three or more accesses to one or more lines within the same cache set. Or The DCU prefetcher gets triggered when there are three or more accesses to one or more lines within two consecutive cache sets. Can you give some hints or idea to me how to do it so that I can verify in my system? — bholanath, Nov 30 '18 at 15:17
I tried this way to verify whether DCU prefetcher trigger after 3 or more access to cache lines to same cache set. Here is my approach - (i) I have created 4KB array. (ii) Accessing A[0] once then checking whether A[16] is prefetched or not. (iii) Accessing A[0] twice consecutively then checking whether A[16] is prefetched or not. (iv)Accessing A[0] thrice consecutively then checking whether A[16] is prefetched or not. I am expecting A[16] should be prefetched at step (iv). — bholanath, Nov 30 '18 at 15:19
In this link http://www.manualsdir.com/manuals/733523/adlink-atca-6200a.html?page=55 , it says , DCU streamer prefetchers detect multiple reads to a single cache line in a certain period of time and choose to load the following cache line to the L1 data caches. — bholanath, Nov 30 '18 at 15:23
@PeterCordes I've tried inserting `mfence`, `lfence`, or `lock`ed instructions in the sequence of instructions that train the prefetcher and in the sequence of instructions that test the prefetcher. Their existence in the code doesn't seem to be affecting the behavior of the DCU prefetcher. — Hadi Brais, Nov 30 '18 at 16:34
@bholanath I will able to post my code here after cleaning it up whenever I get the chance. The basic idea is to have to sequences of instructions, the first is the training seq and the second is the testing seq. The first sequence contains the access pattern for which we want to check whether that pattern triggers the prefetcher. The testing sequence contains the code that checks which lines got prefetched at which cache level by measuring latency accurately. To repeat the experiment, you have to start from a clean prefetching state (e.g. new memory locations). — Hadi Brais, Nov 30 '18 at 16:45
Using the `clflush` instruction can be misleading. It'd better to disable all other prefetchers when performing the experiments (although there will still be some L3 prefetching going on, but that's OK). — Hadi Brais, Nov 30 '18 at 16:46
The statement from the manual you cited doesn't contradict the results. It is indeed correct. — Hadi Brais, Nov 30 '18 at 16:49
Really interesting results. If you ever want to share your test code, I'd be interested to see it. About physical or virtual, which appears a lot in your answer - I think it is basically moot or "the same either way". That is, since the prefetcher doesn't cross pages, whether you interpret it has a physical or virtual address, you get the same result, since starting from a P or V address and talking about the "next line" (ie +64) gives the same result, except in the case of page crossing. FWIW, at a hardware level everything after the L1 is using physical addresses, AFAIK. — BeeOnRope, Nov 30 '18 at 22:39
The above idea true for all four prefetcher, BTW - whether you think in P or V you get the same result, of course except for the NPP (not one of the four, and AFAIK cannot be disabled) which is the guy that bridges the gap, so to speak. — BeeOnRope, Nov 30 '18 at 22:41
We're you able to determine if the DCU prefetcher can generate misses all the way to DRAM, or does it operate only to L2, and stop if ir gets a miss there? — BeeOnRope, Nov 30 '18 at 22:43
@BeeOnRope The L1 only needs the physical page number to access any cache line, so it would be more economic to send only the physical number and not the virtual one. The physical page number can be used for all purposes. But this impacts the behavior of the prefetcher because there can be multiple virtual pages that map to the same physical page or the mapping for a virtual page might change over time. Should aliased virtual pages in different apps be treated the same by the prefetcher? Different virtual pages might exhibit different access patterns... — Hadi Brais, Nov 30 '18 at 23:04
...even if they map to the same physical page. A prefetcher can be designed to consider accesses to each virtual page as a separate stream of accesses. That said, this issue is not that important for the simple DCU prefetcher. It will have a small impact on its triggering condition though. I'll update the answer to clarify why. Regarding your question, yes the DCU prefetcher's requests can go all the way to DRAM. Although now I'm thinking of another open question. If the TLB entry of a tracked page got evicted for some reason and a prefetcher request got missed in the TLB, would it be dropped? — Hadi Brais, Nov 30 '18 at 23:07
I think yes because the TLB entry probably got evicted because it was not being accessed frequently anyway. — Hadi Brais, Nov 30 '18 at 23:08
@HadiBrais - you are right about physical vs virtual tracking for PF being different when you consider aliasing pages, I hadn't considered that case. So I think you are right that the distinction is relevant and that it is most likely that the L1 PF works in the "physical domain", i.e., all of its inputs and outputs are physical addresses. This is fairly easy to implement, because even at the L1 cache, all addresses are essentially physical (sure, it may be VIPT, but that's largely indistinguishable from PIPT since the index bits are not translated). — BeeOnRope, Dec 06 '18 at 22:07
@HadiBrais Hi Hadi, are you able to share your test code? I'm doing a similar experiments on the L1 prefetcher as you(also take 3 times misses to trigger the DCU prefetcher), but I can't see the next line is prefetched. — JasperMa, Feb 24 '21 at 23:55
@JasperMa I've a long list of existing answers to improve or new answers to post, one of them is to post the code for this answer and present experimental results and analysis on several processors, but that takes a lot of time and I've very little free time, so it won't happen any time soon. — Hadi Brais, Feb 25 '21 at 10:42
@HadiBrais Yeah, I understand that.So if you have some time, could you please have a look at this [code](https://stackoverflow.com/questions/66343997/in-which-conditions-the-l1-ip-based-stride-prefetcher-will-be-triggered)? I organized my code and really have no idea what's the problem with it. Thanks in advance. — JasperMa, Feb 25 '21 at 23:51

score 1 · Answer 2 · answered Nov 28 '18 at 23:23

AFAIK, Intel CPUs don't have an L1 adjacent-line prefetcher.

It has one in L2, though, which tries to complete a 128-byte aligned pair of 64-byte cache lines. (So it's not necessarily next, it could be the previous line if the demand-miss or other prefetch that caused one line to be cached was for the high half of a pair.)

See also https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/714832, and the many "related" links here on SO, e.g. prefetching data at L1 and L2. Not sure if either of those have any more details than the prefetch section of Intel's optimization manual, though: https://software.intel.com/en-us/articles/intel-sdm#optimization

I'm not sure if it has any heuristic to avoid wasting bandwidth and cache footprint when only one of a pair of lines is needed, other than not prefetching when there are enough demand misses outstanding.

I think the OP is referring to the DCU prefetcher, which is a next-line prefetcher. Otherwise, if the OP means by "adjacent" the other cache line of a pair of consecutive cache lines, then you'd be right. — Hadi Brais, Nov 28 '18 at 23:54
There are four data prefetchers in total, the OP mentioned three in the numbered list, and so I think they are asking about the fourth. — Hadi Brais, Nov 28 '18 at 23:56

In which condition DCU prefetcher start prefetching?

2 Answers2

Linked