Instruction cache loading and eviction rules

Question

I'm trying to get an idea of how the instruction cache works.

How many extra cachelines gets prefetched when a block of code is being executed? Does it take into account branch prediction?
If a block of code contains a function call, is the function code body loaded sequentially or in a different part of the cache?

For example, are the following code fragments same?

 if (condition) {
   // block of code that handles condition
}

and

if (condition) {
    handle_condition(); // function that handles the condition
}

Which one reduces "holes" in the instruction sequence if the condition is rarely true?

If my first example is part of a code that is frequently run and the if condition is never true, will the body of the if condition be eventually evicted leaving the rest of the code body as-is?

I'm assuming these questions do not have answers that depend on specific micro-architectures. But in case they do, I have an x86-64 Intel Sandy Bridge.

Modern profiling compilers definitely try to pack the hot path in as few cache blocks as possible. — Ben Voigt, Apr 10 '15 at 20:19
@EOF - I'm talking about the case where the function does not get inlined. — Remus, Apr 10 '15 at 20:45
@BenVoigt - In this case, the compiler wouldn't know whether the block is part of the hot path right? — Remus, Apr 10 '15 at 20:48
You could use `gcc` and `clang` extensions to flag such conditions as `unlikely`. Check this post: http://stackoverflow.com/questions/10922607/gcc-likely-unlikely-macro-usage — chqrlie, Apr 11 '15 at 10:10

score 3 · Answer 1 · edited Jun 20 '20 at 09:12

Actually, the answer is very much micro-architectural dependent. This is not something defined by the x86 or any other architecture, but rather left for the designers to implement and improve over the various generations.

For a Sandybridge, you can find an interesting description here The most relevant part is -

The instruction fetch for Sandy Bridge is shown above in Figure 2. Branch predictions are queued slightly ahead of instruction fetch so that the stalls for a taken branch are usually hidden, a feature earlier used in Merom and Nehalem. Predictions occur for 32B of instructions, while instructions are fetched 16B at a time from the L1 instruction cache.

Once the next address is known, Sandy Bridge will probe both the uop cache (which we will discuss in the next page) and the L1 instruction cache. The L1 instruction cache is 32KB with 64B lines, and the associativity has increased to 8-way, meaning that it is virtually indexed and physically tagged. The L1 ITLB is partitioned between threads for small pages, with dedicated large pages per thread. Sandy Bridge added 2 entries for large pages, bringing the total to 128 entries for 4KB pages (for both threads) and 16 fully associative entries for large pages (for each thread).

In other words, and as also shown in the diagram, the branch prediction is the first step of the pipe, and precedes the instruction cache access. Therefore, the cache would hold the addresses "trace" as predicted by the branch predictor. If a certain code snippet is hardly accessed, the predictor will avoid it and it would age out from the I-cache over time. Since the branch predictor should be able to handle function calls, there shouldn't be fundamental difference between your code snippets.

This of course breaks due to alignment issues (the I-cache has 64B lines, you can't have partial data there, so inlined code may actually cause more useless overhead than a function call, although both are bounded), and due to false predictions of course. It's also possible that other HW-prefetchers are working and may fetch the data to other levels, but it's not something that was officially disclosed (The guides only mention some L2 cache prefetching that may help reduce the latency without thrashing your L1 cache). Also note that Sandy bridge has a uop cache that may add further caching (but also more complexity).

Instruction cache loading and eviction rules

1 Answers1