Summary: I don't think this will be any worse than you'd predict (no special penalties due to this corner case), but it's still usually not worth doing because of split caches / TLBs and having few other benefits. To optimize for the cold-cache case, consider using immediate data (e.g. store the LUT to the stack before use, or there's a lot you can with an immediate bitmap and a bt
instruction).
Ideally, you can place your constants with others that are used by code that often runs before or after this code. Compilers can use profile-guided optimization to find "hot" data and cluster it together (at least Intel's VTune help suggests that this can help reduce overall dTLB miss rate).
Possible benefits: L2 cache hit for data loads, or at least DRAM-page locality if the function isn't tiny and the caches were cold to start with.
The main downside is cache/TLB efficiency. Data in code lines / pages is pollution of the L1I cache and iTLB, and code in data lines / pages is pollution of the L1D cache and dTLB.
The first rule of caching is that caches work. Code that runs frequently (and its data) will often be hot in cache. Code that doesn't run frequently is often not important for performance. Trying to optimize for the worst case this way could just end up making the best case less likely (more L1I / L1D misses and/or more TLB misses from including more lines/pages in both code and data footprints).
L2 and outer caches are unified, but L1 is split, and so are the L1 TLBs on any sane microarchitecture, for multiple reasons (physical proximity on chip to front-end or execution units, total number of read/write ports, etc.), but especially on x86 where there is near-zero overlap between code and data in compiler-generated code. All modern x86 designs also use an L2TLB that handles misses from either L1iTLB or L1dTLB. (Intel's optimization manual calls it the STLB, for Second level).
But unlike the L2 cache, I think Intel's STLB is a victim cache for both the iTLB and dTLB. (I don't remember where I read this, and can't find a source for it, though.) If my memory is correct, an L1TLB miss that hits in the STLB exchanges entries, so nothing is evicted or duplicated. On a miss in both levels, a page-walk to only loads the L1iTLB with the new entry. (I think the evicted entry goes into the STLB, and the LRU entry from that set in the STLB is evicted).
Thus, if I'm right about TLB behaviour on Intel CPUs, the dTLB miss from movzx eax, byte [lut + eax]
will miss in the STLB (if caches were cold to start with), triggering another page walk even though the same page must already be hot in the iTLB for the data load to be executed. At least the page-table entries will be hot in L1D cache, and any internal page-walker caches.
It would be possible to test this behaviour with code that jumped from page to page, loading itself as data. e.g. repeat this block: here: mov eax, [rip+0]
/ jmp here+4096
/ align 4096
. Then look at perf counters for stlb misses from data loads (not code-fetch). This would make code/data locality in terms of 4k or 2M pages much less valuable than it otherwise would be, but still no worse than totally separate (except for the pollution issue that there could have been useful code there reducing the total number of code pages touched).
If your function+data isn't all contained in the same cache line, a load early in the function could result in outstanding misses (to L2) for the same line from both L1D(demand load) and L1I(speculative code fetch). I don't know if there's any problem with that on any x86 uarches. I'd guess probably no worse than usual, and hopefully better than outstanding missed for two different lines. I'd guess that hardware doesn't trigger any kind of slow corner-case handling for this, but I haven't tested.
If the end of the function+data is on the next page, you could even get and iTLB and dTLB miss in parallel for the same page, from a demand load + code-fetch.
However, I think data loaded from the same cache line as the current code will typically hit in L2 (although it's possible that it's still hot in L1I but evicted from L2, and possibly even from L3 on CPUs like Skylake-AVX512 where L3 isn't inclusive). This may sometimes be worth the inflation of data working set and cache working set that comes from mixing both in the same line.
Non-x86:
ARM compilers (and assemblers for ldr r0, =constant
pseudo-instructions) use literal-pools to load constants larger than 16 bits with small PC-relative displacements. I think these often end up on the same page, and sometimes the same cache-line, as code. Obviously ARM microarchitectures are designed to run such code efficiently (except for the wasted I-cache / D-cache space which is unavoidable). But presumably the code-size / instruction-count benefit is usually worth it. I'm not sure why this is common on ARM but not other RISC ISAs (at least I think it's not common on MIPS / PowerPC). Modern ARM has good support for creating arbitrary 32-bit constants with 2 instructions, and many bit-patterns can be created with a single instruction (using the barrel shifter with an immediate mov
or mvn
).
But there's no reason to expect that x86 microarchitectures take any special care to handle such cases any more efficiently than they would by default, because this pattern is not common in x86. It's not used by compilers, and the only RIP-relative addressing mode uses a rel32
displacement so there isn't even a code-size advantage from placing data very near code. Only the locality benefit for L3/L2 (and DRAM pages).
That doesn't mean we should expect it to be slow, only that we can't infer x86 behaviour from the fact that ARM CPUs need to support it efficiently. ARM CPUs that do use / support paging may have a TLB allocation policy that favours this pattern, like allocating an L2TLB entry on an iTLB miss. (If they use a multi-level TLB at all).
For example, can the instruction fetch mechanism of the CPU get confused by fetching beyond the ret in my example and trying to interpret the lookup table as (nonsense) instructions?
Speculative execution beyond a ret
is usually not a problem, it seems.
I tried to test this once (with a fairly poor test that I didn't put much effort into), but I couldn't find any effect. Perhaps I didn't have a big enough test to defeat branch-prediction for the ret
, or speculative execution doesn't continue beyond ret
the way it does for other indirect jumps. (Even if a call
instruction is the first instruction of a function, and the callee is contiguous with that function, the correct return address is after the call
, not to run the call
again. So speculative execution of instructions following ret
can only be useful in cases where something put a fake return address on the stack.)
If the last instruction before your data was an indirect jmp
, then it would make sense to worry about how the data decoded as instructions. You could block speculative execution by placing an int3
or ud2
after it, before your data, as recommended by Intel's optimization manual:
3.4.1.6
Branch Type Selection
blah blah fall-through is the fall-back default prediction for indirect jumps (but they don't mention ret
). Bogus instructions can slow down branch recovery.
Also, data immediately following indirect branches may appear as branches to the branch predication hardware, which can branch off to execute other data
pages. This can lead to subsequent self-modifying code problems.
Assembly/Compiler Coding Rule 14. (M impact, L generality) When indirect branches are
present, try to put the most likely target of an indirect branch immediately following the indirect
branch. Alternatively, if indirect branches are common but they cannot be predicted by branch
prediction hardware, then follow the indirect branch with a UD2
instruction, which will stop the
processor from decoding down the fall-through path.
Read-only access should be fine for the out-of-order pipeline itself. Only write accesses near (maybe within 2k or 4k of) EIP/RIP cause self-modifying-code machine-nukes / pipeline clears. (So obviously do not use this for non-const
static data, which you normally can't anyway because code pages are normally mapped read/exec but not write.)
If your LUT is small enough, use it as immediate data instead of a load
If cold-cache performance is important, you can store your LUT to the stack with a couple mov r64, imm64
/ mov [m64], r64
instructions. (Or maybe mov r/m32, imm32
).
Immediate bitmaps are great to set up for a bt
instruction. As @Ross points out, you could do
mov eax, 0x5555
bt eax, edi
setc al
Or with the bitmap as an immediate operand to a test
instruction:
xor eax, eax
bts eax, edi ; 1U << arg
test eax, 0x5555
setc al
Compilers will use this trick for switch
when a lot of case-labels all run the same code, like in this case. On Godbolt with gcc and clang.
Another example: a vowel/consonant bitmap in a code-golf answer for classifying strings according to whether their vowel/consonant pattern is palindromic.
In truly hot functions, it is often better to load instead of mov-immediate, especially if it saves multiple mov
instructions. But even saving one fused-domain uop by using a memory operand for an ALU instruction can be worth it, so there is a tradeoff between cold vs. hot cache performance. (But never use bt
with a memory operand; its performance is garbage because of the crazy-CISC semantics for indexing a bit-string instead of wrapping to the dword or qword selected by the addressing mode the way it wraps with a register destination.)
Or simply compute instead of using a LUT at all. test eax,eax
/ setp al
because parity (of the low byte only) is supported in hardware on x86. Other architectures with hardware popcnt could use that and take the low bit for even / odd parity.
But other problems can save a lot of work with a LUT or a small vector constant. (Maybe compressed for loading with a broadcast-load like movddup
or vpbroadcastd
, or a per-element expansion like pmovsx
/pmovzx
)