Dynamically executing large volumes of execute-once, straight-line x86 code

Question

Dynamically generating code is pretty well-known technique, for example to speed up interpreted languages, domain-specific languages and so on. Whether you want to work low-level (close to 1:1 with assembly), or high-level you can find libraries you help you out.

Note the distinction between self-modifying code and dynamically-generated code. The former means that some code that has executed will be modified in part and then executed again. The latter means that some code, that doesn't exist statically in the process binary on disk, is written to memory and then executed (but will not necessarily ever be modified). The distinction might be important below or simply because people treat self-modifying code as a smell, but dynamically generated code as a great performance trick.

The usual use-case is that the generated code will be executed many times. This means the focus is usually on the efficiency of the generated code, and to a lesser extent the compilation time, and least of all the mechanics of actually writing the code, making it executable and starting execution.

Imagine however, that your use case was generating code that will execute exactly once and that this is straight-line code without loops. The "compilation" process that generates the code is very fast (close to memcpy speed). In this case, the actual mechanics of writing to the code to memory and executing it once become important for performance.

For example, the total amount of code executed may be 10s of GBs or more. Clearly you don't want to just write all out to a giant buffer without any re-use: this would imply writing 10GB to memory and perhaps also reading 10GB (depending on how generation and execution was interleaved). Instead you'd probably want to use some reasonably sized buffer (say to fit in the L1 or L2 cache): write out a buffer's worth of code, execute it, then overwrite the buffer with the next chunk of code and so on.

The problem is that this seems to raise the spectre of self-modifying code. Although the "overwrite" is complete, you are still overwriting memory that was at one point already executed as instructions. The newly written code has to somehow make its way from the L1D to the L1I, and the associated performance hit is not clear. In particular, there have been reports that simply writing to the code area that has already been executed may suffer penalties of 100s of cycles and that the number of writes may be important.

What's the best way of generating a large about of dynamically generated straight-line code on x86 and executing it?

Probably you'd want to aim for a fraction of L2 cache size, because it's a unified cache that L1I can hit in. It's probably not ideal if the code is in M state in L1D. Depending on how much data access the generated code does, there's probably a sweet spot like 1/2 or 1/4 of L2 so that most of the code you stored can be fetched from L2, without being evicted before you reach it. And hopefully *will* have been written-back from L1D; IDK how much this matters. It might even take a round-trip to L3 to sort this out. — Peter Cordes, Oct 11 '17 at 22:55
I'd guess that generating large amounts is easier than only a small block, because writing more than L1D capacity means that write-back to L2 will already be done for the first instructions. (If you write a small loop, I wonder if `clwb` helps? But probably not, that forces write-back all the way to DRAM.) — Peter Cordes, Oct 11 '17 at 22:58
I suspect the answer is not to generate 10GB of code. Either put a loop in there so it's much shorter or turn your code generator into an interpreter. — Ross Ridge, Oct 12 '17 at 02:02
@RossRidge - well I'm not always generating 10 GB of code. More like an amount of code roughly to the input problem size, kind of in the spirit of [this question](https://stackoverflow.com/q/17738154/149138) - although I'm generating `mov` instructions directly, not `call`. Imagine a case where on an x86 platform without SIMD scatter (essentially everything in the wild other than a little bit of Skylake-X) and you get an array of tightly packed lengths want to do a series of stores at the offsets implied by the lengths - using SIMD to generate the load/store code seems reasonable. — BeeOnRope, Oct 12 '17 at 02:51
@Bee: I'm leaning towards Ross's point that generating + executing the code only once is likely to be at least as expensive as just "interpreting" (i.e. not using dynamic code-gen at all). Even if you simplify the crap out of everything so you're only generating a few fixed-format instructions (because x86 machine-code encoding isn't exactly simple in the general case), you're still doing a store for every instruction you generate, and that's pure overhead. If you're going to reuse the generated code, it could be worth it. — Peter Cordes, Oct 12 '17 at 04:31
@PeterCordes No, it would be a SIMD store of 32 bytes of instructions bytes at once. The instructions are fixed and the encodings hardcoded, only need a bit of bit bashing to "encode" the offset in each store. A store per instruction would definitely kill the idea though... — BeeOnRope, Oct 12 '17 at 04:32

score 1 · Answer 1 · answered Oct 12 '17 at 01:56

1

I think you're worried unnecessarily. Your case is more like when a process exits and its pages are reused for another process (with different code loaded into them), which shouldn't cause self-modifying code penalties. It's not the same as when a process writes into its own code pages.

The self-modifying code penalties are significant when the overwritten instructions have been prefetched or decoded to the trace cache. I think it is highly unlikely that any of the generated code will still be in the prefetch queue or trace cache by the time the code generator starts overwriting it with the next bit (unless the code generator is trivial).

Here's my suggestion: Allocate pages up to some fraction of L2 (as suggested by Peter), fill them with code, and execute them. Then map the same pages at the next higher virtual address and fill them with the next part of the code. You'll get the benefit of cache hits for the reads and the writes but I don't think you'll get any self-modifying code penalty. You'll use 10s of GB of virtual address space, but keep using the same physical pages.

Use a serializing operation such as CPUID before each time you start executing the modified instructions, as described in sections 8.1.3 and 11.6 of the Intel SDM.

answered Oct 12 '17 at 01:56

prl

11,716
2
13
31

Are you talking about P4 (netburst) here? It's the only uarch with a trace cache. Sandybridge-family's decoded-uop cache is also virtually addressed, but it's *not* a trace cache. (Apparently the same code may get cached multiple times with multiple entry points to the same block, but as I understand it that usually sorts itself out because of the limit of 3 lines per 32B block. An unconditional branch always ends a uop cache line, and IIRC a line can only cache uops from instructions that are contiguous. So it doesn't trace through predicted-taken conditional branches either.) – Peter Cordes Oct 12 '17 at 02:20
About mapping the "same" pages - you mean to use `mmap` on file-backed space or something like `shmat` to arrange that the new virtual mapping is backed by the same physical pages and since cache indexing is effectively physical (L1 is VIPT but that's effectively physical from a functionality point of view), I'll not thrash the cache as much? I had thought of it, but I thought this would break the rules about SMC, but after re-reading 11.6, it seems that it will work: the cache coherency check is on physical address and the prefetch/pipeline issues are avoided by a serializing instruction. – BeeOnRope Oct 12 '17 at 03:34
You'll get a TLB miss per-page, but that's probably fine. – BeeOnRope Oct 12 '17 at 03:34
1

@BeeOnRope: calling `mmap` is probably more expensive than one SMC machine-clear. It has to do significant work in the kernel to modify your page tables, as well as invalidating the relevant TLB entries. – Peter Cordes Oct 12 '17 at 03:54
@Peter, I was just quoting what's in the book. (I did at least check the latest instead of relying on my paper copy, like I usually do.) But I didn't read it carefully enough to realize that it doesn't say anything about current processors. Still, I'd expect that the behavior of the decoded uop cache wrt self-modifying code is similar to the trace cache, wouldn't you? – prl Oct 12 '17 at 04:34
@prl: not really for 2 reasons: First because SnB-family doesn't fall flat on its face when running from the decoders instead of uop cache. P4 had very low decode throughput, and was only good if the trace cache was hot. So invalidating traces was Bad. – Peter Cordes Oct 12 '17 at 04:37
Second: The trace cache packs instructions together in execution order, so a single trace might contain instructions from multiple pages. It's probably easy for a store near something in the trace cache to invalidate other parts of it. (i.e. trace caches are problematic for x86 because I-cache needs to be coherent with D-cache). But the uop cache is *not* a trace cache, and each "way" caches contiguous instructions from within a 32B chunk. It's much easier for HW to snoop the uop cache and only safely invalidate the instructions necessary without clobbering too much. – Peter Cordes Oct 12 '17 at 04:41
@Peter, by always using fresh virtual addresses, TLB invalidations aren't actually needed, though the module that does them may not know that. – prl Oct 12 '17 at 04:44
There might well be an SMC machine clear if stores hit an address that's actually being prefetched as code, though. Or maybe it only snoops later in the pipeline for instructions that are already in the OOO core? Or at least decoded? Anyway, I don't expect that the entire uop cache is a danger zone, although it might be. – Peter Cordes Oct 12 '17 at 04:44
You shouldn't assume that the CPU doesn't do some kind of negative caching, or caching of blocks of PDEs / PTEs inside the page-walk hardware. See [Andy Glew's write-up](https://sites.google.com/site/paulclaytonplace/andy-glew-s-comparch-wiki/caching-invalid-entries-in-tlb) about why it's not safe to make that kind of assumption about not needing TLB invalidation. And since a TLB invd for that entry might invalidate internal HW page-walker caches, it could make the resulting TLB miss when you jump to the block more expensive than if you didn't invalidate. So it's expensive or unsafe. – Peter Cordes Oct 12 '17 at 04:49
I'd guess that mmap picks expensive over unsafe. – Peter Cordes Oct 12 '17 at 04:50
@Peter, Intel processors don't cache not-present entries. The page you referred to is speculative. See 4.10.2.3. – prl Oct 12 '17 at 05:08
I'm quite sure that invalidation isn't needed when adding page-table entries (e.g., on `mmap` and/or a subsequent fault that brings in a page), except in special cases like overwriting an existing mapping. It's only `munmap` and other calls like `mprotect` with a reduction in permissions that need invalidation. That doesn't mean `mmap` is super cheap though. You still pay the price to set up the PTEs, at fault time or greedily if `MAP_POPULATE`. I would probably try to use 2 MB pages to cut down on the overhead. – BeeOnRope Oct 12 '17 at 05:18
Yeah, operating systems always get in the way; that's why I avoid them. Setting up a PTE is just a single qword write in my world. – prl Oct 12 '17 at 05:22
@prl: thanks, I hadn't realized Intel's current manuals gave strong guarantees that that would be safe in the future. 4.10.4.3 is also explicit about this: *If a paging-structure entry is modified to change the P flag from 0 to 1, no invalidation is necessary.* Anyway, neat trick if you're in ring0, but a Linux `mmap` system call + a TLB miss (for the new page) is definitely not cheap. – Peter Cordes Oct 12 '17 at 05:44
Thinking about it a bit more I don't follow how the virtual mapping "trick" will help. If the penalty is really about already-in-pipeline SMC detection, then that would either break (because the wrong code really is in the pipeline or was not detected) or will suffer the same SMC penalty anyways if the hardware is [more conservative](https://stackoverflow.com/q/17395557/149138) than the spec. If the penalty is related to coherence between L1I and L1D and other involved caches, the penalty has to be paid in any case. I am missing something? – BeeOnRope Oct 13 '17 at 00:36

score 0 · Answer 2 · answered Oct 12 '17 at 03:31

0

I'm not sure you'll stand to gain much performance by using a gigantic amount of straight-line code instead of much smaller code with loops, since there's significant overhead in continually thrashing the instruction cache for so long, and the overhead of conditional jumps has gotten much better over the past several years. I was dubious when Intel made claims along those lines, and some of their statements were rather hyperbolic, but it has improved a lot in common cases. You can still always avoid call instructions if you need to for simplicity, even for tree recursive functions, by effectively simulating "the stack" with "a stack" (possibly itself on "the stack"), in the worst case.

That leaves two reasons I can think of that you'd want to stick with straight-line code that's only executed once on a modern computer: 1) it's too complicated to figure out how to express what needs to be computed with less code using jumps, or 2) it's an extremely heterogeneous problem being solved that actually needs so much code. #2 is quite uncommon in practice, though possible in a computer theoretical sense; I've just never encountered such a problem. If it's #1 and the issue is just how to efficiently encode the jumps as either short or near jumps, there are ways. (I've also just recently gotten back into x86-64 machine code generation in a side project, after years of not touching my assembler/linker, but it's not ready for use yet.)

Anyway, it's a bit hard to know what the stumbling block is, but I suspect that you'll get much better performance if you can figure out a way to avoid generating gigabytes of code, even if it may seem suboptimal on paper. Either way, it's usually best to try several options and see what works best experimentally if it's unclear. I've sometimes found surprising results that way. Best of luck!

answered Oct 12 '17 at 03:31

Neil Dickson

305
1
8

Thanks Neil. The problem isn't jumps or loops per-se, it's that it may be hard to reach maximum performance for a certain type of decoding operation using a loop: not because of the loop, but because there is an extra layer of indirection needed, or conditional execution is too small. – BeeOnRope Oct 12 '17 at 03:37
From my comments on the OP, consider this case: Imagine a case where on an x86 platform without SIMD scatter (essentially everything in the wild other than a little bit of Skylake-X) and you get an array of tightly packed lengths want to do a series of stores at the offsets implied by the lengths - using SIMD to generate the load/store code seems reasonable. – BeeOnRope Oct 12 '17 at 03:38
You can use SIMD to decode the offsets in either case (codegen or not), but now lets say you have a `ymm` vector of 8-bit offsets, and you want to store something to each of those 8-bit offsets (indexing from a common base address). How do you about it? One way is to `movd` from that `ymm` into a GP register, and then `and rax, 0xFF` and `shr rax, 8` and so to extract the address, and then do the actual store, `mov [rdx + rax], whatever`, but this costs a few instructions per store. You could try to store the `ymm` to memory and then read it out 1-byte at a time, this costs an extra load. – BeeOnRope Oct 12 '17 at 03:41
Similarly if the `whatever` you are going to store is available in a SIMD reg, you need to propagate that to scalar code as well. On the other hand, you could just generate the `mov [rbx + offset], whatever` instruction directly, in a vectorized way (if whatever is a register this is only 3 bytes for a `disp8` offset) and then execute it, taking only one instruction. – BeeOnRope Oct 12 '17 at 03:43
(Sorry, replied just to your first comment so far.) Have you actually measured overhead from that? I would be very surprised if address computation added so much overhead over constant addresses. If the code isn't memory-bound, there probably aren't many address computations, and if the code is memory-bound, the address computation probably is masked by the memory access time. If you have found a case where the addressing overhead is significant, though, particularly more significant than the overhead of thrashing the instruction cache, I'd be interested to hear about it. – Neil Dickson Oct 12 '17 at 03:46
It's not the addressing overhead, it's the effective IPC limit of 4 on Intel x86 that you hit. Intel CPUs can do 1 store per cycle and (to a first order approximation), 4 total simple instructions per cycle. In the scenario described above you often hit the limit of 4 (when you consider the instructions needed to also load the value to store) before you hit the 1 store/cycle (i.e., you only have a "budget" of 3 instructions to do _everything_ else besides the store) limit. Yup, I've measured it. – BeeOnRope Oct 12 '17 at 03:50
What you've described doesn't appear to necessitate more than a few hundred bytes of code, in which case, that changes the tradeoffs significantly. If however, you're repeatedly generating, writing to memory, and then executing gigabytes of code only once, it'll likely hit some serious bottlenecks. – Neil Dickson Oct 12 '17 at 03:56
It requires a few bytes of code to decode a few bytes of input, yup. If you have gigabytes of input to decode, it will require gigabytes of code. Of course, the gigabytes of code doesn't need to actually be in memory at once, so "thrashing" isn't inherent to that approach: you can simply alternate between generating code of size N, and then executing same code, into the same buffer over and over, where N is appropriate sized to fit in some cache level. So the footprint is fixed even as the problem size grows. How to do this efficiently is the crux of the question. – BeeOnRope Oct 12 '17 at 04:02

Dynamically executing large volumes of execute-once, straight-line x86 code

2 Answers2