0

I have a particular function call. The first instruction of the function is missing in the instruction cache more often than I'd like. Due to the places where the instruction/function is called from, code layout optimization is not a great idea. These insights come after extensive profiling and execution trace analysis.

Is there any way to prefetch for the instruction cache in software? Something like __builtin_prefetch(&function). But that is for data cache - can I induce a prefetch for the instruction cache in the source code?

To be clear, calling the function will give me something like call 0x555 in the assembly, where 0x555 is the address of the function, and I want to ensure 0x555 it is in the instruction cache before call 0x555.

whoami
  • 85
  • 1
  • 7
  • Standard C and C++ don't have any such thing, so you're looking for implementation-dependent functions. What compiler / architecture? – Nate Eldredge Oct 18 '21 at 23:53
  • This would be under QEMU, so actually ideally the architecture would be flexible. But for now, let's say x86. And the compiler is GCC. – whoami Oct 18 '21 at 23:55
  • Note many machines have unified L2/L3 cache, so data prefetch may still be better than nothing. I'm not aware of any icache prefetch functionality on x86. – Nate Eldredge Oct 18 '21 at 23:55
  • That's a good point - I did notice even with the data prefetch (putting it in L3), the miss ratio went down a bit. – whoami Oct 18 '21 at 23:56
  • I'd add the `x86-64` tag to your question, someone like Peter Cordes may have some good insights about this – WBuck Oct 18 '21 at 23:57
  • Other architectures might, though, e.g. ARM64 has the `PRFM PLI*` family. Note that with regard to QEMU, a separate question is whether the emulator actually does any prefetching or if it just ignores the hint. – Nate Eldredge Oct 18 '21 at 23:59
  • 2
    Looks like this question has already been asked: https://stackoverflow.com/questions/48571263/bring-code-into-the-l1-instruction-cache-without-executing-it – WBuck Oct 19 '21 at 00:03
  • @WBuck: Yes, but note that the use-case there was to prime things for a microbenchmark, *not* as part of getting an *overall* speedup. If you just want that, prefetch into L2 cache with a standard data prefetch would probably be better, since the answer on the linked duplicate involves intentionally causing a branch mispredict, as well as probably an I-cache miss. (Although the I-cache miss can be in a mis-speculated branch so it doesn't have to stall for it.) I added [X86 prefetching optimizations: "computed goto" threaded code](https://stackoverflow.com/q/46321531) to the dup list – Peter Cordes Oct 19 '21 at 03:24
  • Possibly also related: [Cost of a 64bits jump, always 10-22 cycles the first time?](https://stackoverflow.com/q/39682209) – Peter Cordes Oct 19 '21 at 03:25

0 Answers0