Is placing code and read-only data it uses right next to each other a good idea?

Question

While writing a lookup-table related answer for another question I was reminded of something that I always wonder about: is it smart to locate a small amount of non-code data needed by a function right next to the function, instead of the traditional approach of putting it in another section?

Let's say you have a small function, which uses a small, read-only, lookup table. The usual approach seems to be to locate the lookup table in a data section, such as .rodata which will generally place it at some distance from the text of the function itself.

For example, a simple function that calculates the parity of a byte, using a 16-entry LUT:

GLOBAL parity

SECTION .text
parity:
  mov   eax, edi
  shr   edi, 4
  xor   eax, edi
  and   eax, 15
  movzx eax, byte [lut + eax]
  ret

SECTION .rodata
lut:
db 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1

Now the method just happens to be about 16 bytes of code, and the lookup table is 16 bytes as well. So they could easily fit nicely in the same cache line. This seems to be like a win-win - the lut is always accessed in the function, and only accessed by the function, so we potentially reduce the cost of calling this function when cold from 2 to 1 cache misses, by putting the code and data side-by-side:

GLOBAL parity

SECTION .text
parity:
  mov   eax, edi
  shr   edi, 4
  xor   eax, edi
  and   eax, 15
  movzx eax, byte [lut + eax]
  ret

lut:
db 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1

It's the same as before, just with the table in the .text section immediately following the function¹.

As far as I know it is generally allowed in most architectures/executable formats, so what are the problems with it?

For example, can the instruction fetch mechanism of the CPU get confused by fetching beyond the ret in my example and trying to interpret the lookup table as (nonsense) instructions?

¹ Note that I deliberately put the table after the code since the code is needed first, but perhaps it doesn't matter in light of critical word first, and anyway the interaction is unclear.

For its worth, on x86 you can compress the size of the lookup table by using the BT instruction, though the fastest way to calculate parity is probably `xor eax, eax` `test dil, dil` `setp al`. — Ross Ridge, Jan 18 '17 at 21:00
@RossRidge - indeed - I chose parity as an easy, understandable, example, but even as I was writing it I thought "OK, I hope someone doesn't try to use the parity flag on x86" :). Technically I don't think you necessarily need the `xor eax, eax` since at least in the SysV ABI you are allowed [to return garbage in the other bytes](http://stackoverflow.com/q/40475902/149138). Putting the `xor` in there still might be faster since it is _close_ to free and breaks the dependency on the previous value (what's better will depend on the surrounding code). — BeeOnRope, Jan 18 '17 at 21:08
I didn't follow how `BT` can be used here though, can you elaborate a bit? — BeeOnRope, Jan 18 '17 at 21:09
You could replace the MOVZX instruction with `bt [lut], eax` `setc al` and use a 2 byte lookup table. Or replace it with `mov di, 0x5555` `bt di, ax` `setc al` and not have a table at all. Or maybe replace the entire function with `movzx edi, dil` `bt [lut], edi` `setc al`, though that would require a 32 byte lookup table. All these trade more or less instructions for a smaller or larger table respectively, so which is better would depend on how expensive the memory access is. — Ross Ridge, Jan 19 '17 at 00:38
Ah, yes - in this case (1 bit return value) you can compress the table down to 1 bit per entry, so that opens up a lot of tradeoffs using smaller tables. Perhaps I should have used an example like `popcnt` that doesn't offer this option. Certainly something more complicated than parity was involved in the [motivating example](http://stackoverflow.com/q/41529921/149138). @RossRidge — BeeOnRope, Jan 19 '17 at 01:04

score 2 · Answer 1 · answered Nov 21 '17 at 09:32

Summary: I don't think this will be any worse than you'd predict (no special penalties due to this corner case), but it's still usually not worth doing because of split caches / TLBs and having few other benefits. To optimize for the cold-cache case, consider using immediate data (e.g. store the LUT to the stack before use, or there's a lot you can with an immediate bitmap and a bt instruction).

Ideally, you can place your constants with others that are used by code that often runs before or after this code. Compilers can use profile-guided optimization to find "hot" data and cluster it together (at least Intel's VTune help suggests that this can help reduce overall dTLB miss rate).

Possible benefits: L2 cache hit for data loads, or at least DRAM-page locality if the function isn't tiny and the caches were cold to start with.

The main downside is cache/TLB efficiency. Data in code lines / pages is pollution of the L1I cache and iTLB, and code in data lines / pages is pollution of the L1D cache and dTLB.

The first rule of caching is that caches work. Code that runs frequently (and its data) will often be hot in cache. Code that doesn't run frequently is often not important for performance. Trying to optimize for the worst case this way could just end up making the best case less likely (more L1I / L1D misses and/or more TLB misses from including more lines/pages in both code and data footprints).

L2 and outer caches are unified, but L1 is split, and so are the L1 TLBs on any sane microarchitecture, for multiple reasons (physical proximity on chip to front-end or execution units, total number of read/write ports, etc.), but especially on x86 where there is near-zero overlap between code and data in compiler-generated code. All modern x86 designs also use an L2TLB that handles misses from either L1iTLB or L1dTLB. (Intel's optimization manual calls it the STLB, for Second level).

But unlike the L2 cache, I think Intel's STLB is a victim cache for both the iTLB and dTLB. (I don't remember where I read this, and can't find a source for it, though.) If my memory is correct, an L1TLB miss that hits in the STLB exchanges entries, so nothing is evicted or duplicated. On a miss in both levels, a page-walk to only loads the L1iTLB with the new entry. (I think the evicted entry goes into the STLB, and the LRU entry from that set in the STLB is evicted).

Thus, if I'm right about TLB behaviour on Intel CPUs, the dTLB miss from movzx eax, byte [lut + eax] will miss in the STLB (if caches were cold to start with), triggering another page walk even though the same page must already be hot in the iTLB for the data load to be executed. At least the page-table entries will be hot in L1D cache, and any internal page-walker caches.

It would be possible to test this behaviour with code that jumped from page to page, loading itself as data. e.g. repeat this block: here: mov eax, [rip+0] / jmp here+4096 / align 4096. Then look at perf counters for stlb misses from data loads (not code-fetch). This would make code/data locality in terms of 4k or 2M pages much less valuable than it otherwise would be, but still no worse than totally separate (except for the pollution issue that there could have been useful code there reducing the total number of code pages touched).

If your function+data isn't all contained in the same cache line, a load early in the function could result in outstanding misses (to L2) for the same line from both L1D(demand load) and L1I(speculative code fetch). I don't know if there's any problem with that on any x86 uarches. I'd guess probably no worse than usual, and hopefully better than outstanding missed for two different lines. I'd guess that hardware doesn't trigger any kind of slow corner-case handling for this, but I haven't tested.

If the end of the function+data is on the next page, you could even get and iTLB and dTLB miss in parallel for the same page, from a demand load + code-fetch.

However, I think data loaded from the same cache line as the current code will typically hit in L2 (although it's possible that it's still hot in L1I but evicted from L2, and possibly even from L3 on CPUs like Skylake-AVX512 where L3 isn't inclusive). This may sometimes be worth the inflation of data working set and cache working set that comes from mixing both in the same line.

Non-x86:

ARM compilers (and assemblers for ldr r0, =constant pseudo-instructions) use literal-pools to load constants larger than 16 bits with small PC-relative displacements. I think these often end up on the same page, and sometimes the same cache-line, as code. Obviously ARM microarchitectures are designed to run such code efficiently (except for the wasted I-cache / D-cache space which is unavoidable). But presumably the code-size / instruction-count benefit is usually worth it. I'm not sure why this is common on ARM but not other RISC ISAs (at least I think it's not common on MIPS / PowerPC). Modern ARM has good support for creating arbitrary 32-bit constants with 2 instructions, and many bit-patterns can be created with a single instruction (using the barrel shifter with an immediate mov or mvn).

But there's no reason to expect that x86 microarchitectures take any special care to handle such cases any more efficiently than they would by default, because this pattern is not common in x86. It's not used by compilers, and the only RIP-relative addressing mode uses a rel32 displacement so there isn't even a code-size advantage from placing data very near code. Only the locality benefit for L3/L2 (and DRAM pages).

That doesn't mean we should expect it to be slow, only that we can't infer x86 behaviour from the fact that ARM CPUs need to support it efficiently. ARM CPUs that do use / support paging may have a TLB allocation policy that favours this pattern, like allocating an L2TLB entry on an iTLB miss. (If they use a multi-level TLB at all).

For example, can the instruction fetch mechanism of the CPU get confused by fetching beyond the ret in my example and trying to interpret the lookup table as (nonsense) instructions?

Speculative execution beyond a ret is usually not a problem, it seems.

I tried to test this once (with a fairly poor test that I didn't put much effort into), but I couldn't find any effect. Perhaps I didn't have a big enough test to defeat branch-prediction for the ret, or speculative execution doesn't continue beyond ret the way it does for other indirect jumps. (Even if a call instruction is the first instruction of a function, and the callee is contiguous with that function, the correct return address is after the call, not to run the call again. So speculative execution of instructions following ret can only be useful in cases where something put a fake return address on the stack.)

If the last instruction before your data was an indirect jmp, then it would make sense to worry about how the data decoded as instructions. You could block speculative execution by placing an int3 or ud2 after it, before your data, as recommended by Intel's optimization manual:

3.4.1.6 Branch Type Selection

blah blah fall-through is the fall-back default prediction for indirect jumps (but they don't mention ret). Bogus instructions can slow down branch recovery.

Also, data immediately following indirect branches may appear as branches to the branch predication hardware, which can branch off to execute other data pages. This can lead to subsequent self-modifying code problems.

Assembly/Compiler Coding Rule 14. (M impact, L generality) When indirect branches are present, try to put the most likely target of an indirect branch immediately following the indirect branch. Alternatively, if indirect branches are common but they cannot be predicted by branch prediction hardware, then follow the indirect branch with a UD2 instruction, which will stop the processor from decoding down the fall-through path.

Read-only access should be fine for the out-of-order pipeline itself. Only write accesses near (maybe within 2k or 4k of) EIP/RIP cause self-modifying-code machine-nukes / pipeline clears. (So obviously do not use this for non-const static data, which you normally can't anyway because code pages are normally mapped read/exec but not write.)

If your LUT is small enough, use it as immediate data instead of a load

If cold-cache performance is important, you can store your LUT to the stack with a couple mov r64, imm64 / mov [m64], r64 instructions. (Or maybe mov r/m32, imm32).

Immediate bitmaps are great to set up for a bt instruction. As @Ross points out, you could do

mov   eax, 0x5555
bt    eax, edi
setc  al

Or with the bitmap as an immediate operand to a test instruction:

xor   eax, eax
bts   eax, edi           ; 1U << arg

test  eax, 0x5555
setc  al

Compilers will use this trick for switch when a lot of case-labels all run the same code, like in this case. On Godbolt with gcc and clang.

Another example: a vowel/consonant bitmap in a code-golf answer for classifying strings according to whether their vowel/consonant pattern is palindromic.

In truly hot functions, it is often better to load instead of mov-immediate, especially if it saves multiple mov instructions. But even saving one fused-domain uop by using a memory operand for an ALU instruction can be worth it, so there is a tradeoff between cold vs. hot cache performance. (But never use bt with a memory operand; its performance is garbage because of the crazy-CISC semantics for indexing a bit-string instead of wrapping to the dword or qword selected by the addressing mode the way it wraps with a register destination.)

Or simply compute instead of using a LUT at all. test eax,eax / setp al because parity (of the low byte only) is supported in hardware on x86. Other architectures with hardware popcnt could use that and take the low bit for even / odd parity.

But other problems can save a lot of work with a LUT or a small vector constant. (Maybe compressed for loading with a broadcast-load like movddup or vpbroadcastd, or a per-element expansion like pmovsx/pmovzx)

BeeOnRope · Answer 2 · 2017-01-18T18:49:39.580

1

One thing that I can think of is that while it makes good use of the L2 and higher level caches, it isn't as clear a win in the L1 data and instruction caches on architectures where they are split¹, since the line will appear in both the L1D and L1I caches, wasting some space in each (e.g., the space used to cache the lookup table in the L1I is wasted).

¹ I.e., pretty much all the main architectures today - almost everyone is modified harvard...

edited Jan 18 '17 at 18:49

answered Jan 17 '17 at 23:26

BeeOnRope

60,350
16
207
386

@fuz I have no idea, but I'm interested to hear any details! Usually I'm used to literal pools being a higher level concept that what's required by the CPU architecture - but it could be an overloaded term. BTW, what part of this answer are you replying to specifically? – BeeOnRope Jan 17 '17 at 23:36
I actually didn't ever consider the L1 is split, so ty for enlightening me. I would say that in general case mixing short lut into code is win, mostly due to shared L2 cache, but also the L1I will be IMO not hurt that badly, but I'm just guessing. The cache lines are 64B, and both L1 are 32kiB, so small LUT will pollute it in very minor way, the code in question will fit into single line if aligned, and if other app code is in different part of memory (not after lut), you can count that as zero cache pollution. I would be brave enough to guess that 1-200B LUT would be better almost always. – Ped7g Jan 18 '17 at 14:33
1

Plus to exhaust L1I cache (32kiB !! half of 8 bit machine whole RAM) you have to produce quite a bloated code, sounds to me like generics from C++ or some managed language JIT stuff, not performance-tuned code. And if the code is already not performance-tuned, one more cache miss due to integrated LUT will hardly change anything on global scope. (yeah, I know I'm full of sh*t and assumptions, and making it sound bold, but in reality this is just meaningless comment .. as with everything in performance code land, without profiling it's just babbling around, not to be taken too seriously). – Ped7g Jan 18 '17 at 14:37
@Ped7g - well many applications suffer L1I caches in practice, e.g., things like browsers, which have a lot of code. Of course small highly tuned things like video encoders might not suffer, but something like a game engine will typically have plenty of code - enough to exhaust L1I too. – BeeOnRope Jan 18 '17 at 18:41
Well, but browser is not performance-tuned SW. I mean, let's take firefox as example... half of it is running in javascript (not even counting the page it is displaying, I mean just the browser itself). At that point there's no more point to reason about exhaustion of L1 cache, that's like to reason about extra weight of car chair covers in 2000kg limousine, whether you should remove them or not. Game engines will typically do lot more (~50x) complex things than any other SW, especially many different complex things at the same time, and their code is usually dense enough for me to accept it. – Ped7g Jan 18 '17 at 19:13
@Ped7g Well the browser makers may beg to differ - at least Firefox does periodically tune for things like startup time, performance with lots of concurrent tabs, etc, and of course recently Javascript performance has been _heavily_ tuned by various vendors with the mantle of fastest JS engine being passed back and forth several times. It seems to me that in a matter of a year or so JS almost caught up to where it took Java nearly 20 years to get in terms of JIT technology. – BeeOnRope Jan 18 '17 at 19:16
Of course you are right that something bulky like a browser is never going to be comparable to a small finely tuned piece of code though... Maybe a better example is the Linux kernel. Browse the LKML and you'll see that performance is a **big** focus: many changes are made solely for performance, changes are rejected if they don't perform well enough, new hardware features are often conditionally exploited if they improve performance and so on. I would consider many of the core kernel hackers very well verses in low level concerns. Let L1I blowouts are a key concern in the kernel... – BeeOnRope Jan 18 '17 at 19:18
*"as with everything in performance code land, without profiling it's just babbling around, not to be taken too seriously"* ... the difference is, that kernel devs do measure and profile it. So when your pull request fails, it's not because your "gut feeling" told you that that LUT will exhaust L1I, but because it really does. I honestly believe by a rule of thumb I would simply put the LUT into code as starting point, as it doesn't make source more complex or harder to maintain, so they are good fit for a starting position for profiling and eventual refactoring, if it will be the bottleneck. – Ped7g Jan 18 '17 at 19:42
@Ped7g - yes, but of course the counterpoint is that simple profiling will usually show the LUT to kick ass. So I think there is also a corollary something like: "bad profiling is often worse than no profiling". – BeeOnRope Jan 18 '17 at 19:49

Is placing code and read-only data it uses right next to each other a good idea?

2 Answers2

Non-x86:

If your LUT is small enough, use it as immediate data instead of a load

Linked