How can I prefetch infrequently used code?

Question

I want to prefetch some code into the instruction cache. The code path is used infrequently but I need it to be in the instruction cache or at least in L2 for the rare cases that it is used. I have some advance notice of these rare cases. Does _mm_prefetch work for code? Is there a way to get this infrequently used code in cache? For this problem I don't care about portability so even asm would do.

Why would you want this? It sounds like your slowing every other part of you code for the lease like to be used part. Overall I would think everything would execute slower. — andre, Apr 25 '13 at 15:34
@andre A function may be rarely used but still very sensitive to latency when it is used. — rob mayoff, Apr 25 '13 at 15:39
@andre That can be a useful technique when there is *"advance notice of these rare cases"*. Some processors have an instruction specifically for that. — Drew Dormann, Apr 25 '13 at 15:40
Yeeeeeah, are you sure you understand what you are requesting exactly? Have you determined that this infrequently used code is a performance bottleneck? Or is there another reason you need it prefetched? — TheBuzzSaw, Apr 25 '13 at 15:48
Yes I do understand what I want. This is for a very latency sensitive operation. Imagine you are trying to identify explosions from some sensor data and you need to react as soon as possible. Reaction is thankfully very infrequent but sensing/checking etc. is very frequent. — Carlos Pinto Coelho, Apr 25 '13 at 16:22
And I did profile the code using vtune etc. and for the infrequently called function I am getting many icache misses and have a large CPI. — Carlos Pinto Coelho, Apr 25 '13 at 16:26
Is it for sure that the function cache misses is the cause of the problem and not a symptom? I could think of a host of reason for cache misses. — andre, Apr 25 '13 at 16:32
Related: **[Bring code into the L1 instruction cache without executing it](https://stackoverflow.com/questions/48571263/bring-code-into-the-l1-instruction-cache-without-executing-it)** wants the same thing, but for microbenchmarking rather than keeping a latency-sensitive function hot in L1I$. Still, you could use some of those ideas to keep L1I$ primed while waiting for the event you need to react to. — Peter Cordes, Apr 05 '18 at 19:59
Yeah, FWIW I used the approach described in [this answer](https://stackoverflow.com/a/48572334/149138) and it worked great for me (confirmed by monitoring performance counters). My fucntions are small and written in assembly which makes that approach easy. In other cases it would be more difficult. — BeeOnRope, Apr 05 '18 at 20:26

score 9 · Accepted Answer · answered Apr 25 '13 at 16:01

9

The answer depends on your CPU architecture.

That said, if you are using gcc or clang, you can use the __builtin_prefetch instruction to try to generate a prefetch instruction. On Pentium 3 and later x86-type architectures, this will generate a PREFETCHh instruction, which requests a load into the data cache hierarchy. Since these architectures have unified L2 and higher caches, it may help.

The function looks like this:

__builtin_prefetch(const void *address, int locality);

The locality argument should be in the range 0...3. Assuming locality maps directly to the h part of the PREFETCHh instruction, you want to pass 1 or 2, which ask for the data to be loaded into the L2 and higher caches. See Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 2B: Instruction Set Reference, M-Z (PDF) page 4-277. (Find other volumes here.)

If you're using another compiler that doesn't have __builtin_prefetch, see whether it has the _mm_prefetch function. You may need to include a header file to get that function. For example, on OS X, that function, and constants for the locality argument, are declared in xmmintrin.h.

answered Apr 25 '13 at 16:01

rob mayoff

375,296
67
796
848

Thanks, this is similar to the _mm_prefetch op for gcc right? I guess I need to prefetch more of my function to actually see any benefits. – Carlos Pinto Coelho Apr 25 '13 at 16:27
I believe `_mm_prefetch` is a `#define` for `__builtin_prefetch` under both gcc and clang. The `PREFETCHh` instruction probably requests one cache line, so you will need multiple instructions unless your function fits in one cache line (probably 64 bytes). – rob mayoff Apr 25 '13 at 16:33
3

Note that even when L2 is unified between code and data (and it often is nowadays), the TLB is rarely shared. As such the prefetch instruction when present will most likely use the data TLB (the intel documentation isn't explicit but it uses the word 'data' everywhere). So when the code gets to run the prefetched bytes, it might generate a TLB miss but the entries required to resolve it will already be primed in the L2 (page walking entries are stored in L2 like normal data). – Nicholas Frechette Jul 22 '15 at 02:47
1

I haven't tried it on gcc, but on clang `locality` 0 maps to `prefetcht0` and 1 to `prefetchw`. Other values are not accepted for that parameter. – Björn Lindqvist Jun 12 '16 at 13:31

Mats Petersson · Answer 2 · 2013-04-25T15:47:25.450

3

There isn't any (official [1] x86) instruction to prefetch code, only data. I find this a rather bizarre use-case, where the code-path is known beforehand, but executes rarely, and there is a significant benefit in prefetching the code. It would be great to understand where you've come to the conclusion that there is a significant benefit in pre-loading the code for this special case, since it would require not only analyzing that the code is significantly slower when it's not been hit for a long time, but also determining that there is spare bus-cycles to actually load the code before the processor can prefetch it by it's normal mechanism for loading code.

You may be able to use the prefetch instructions that fetch into L2, which is typically shared between I- and D-cache.

[1] I know there are some "secret" instructions that allow the processor to manipulate cache-content, but since those would require a lot of extra work, even if you could use them in user-mode code [and I expect this is not some kernel-mode code].

edited Apr 25 '13 at 15:47

answered Apr 25 '13 at 15:40

Mats Petersson

126,704
14
140
227

L2 cache is shared between code and data, so using the same instructions should prefetch it to L2. – interjay Apr 25 '13 at 15:46
Added a comment to that effect. – Mats Petersson Apr 25 '13 at 15:47
I did try _mm_prefetch(&func, _MM_HINT_T1) perhaps I need to prefetch further into the function. – Carlos Pinto Coelho Apr 25 '13 at 16:23
Have you measured the difference? What effect did it have? Are you sure the memory is present at all? – Mats Petersson Apr 25 '13 at 16:26

How can I prefetch infrequently used code?

2 Answers2

Linked