x86_64 - Self-modifying code performance

Question

I am reading the Intel architecture documentation, vol3, section 8.1.3;

Self-modifying code will execute at a lower level of performance than non-self-modifying or normal code. The degree of the performance deterioration will depend upon the frequency of modification and specific characteristics of the code.

So, if I respect the rules:

(* OPTION 1 *) Store modified code (as data) into code segment; Jump to new code or an intermediate location; Execute new code;

(* OPTION 2 ) Store modified code (as data) into code segment; Execute a serializing instruction; ( For example, CPUID instruction *) Execute new code;

AND modify the code once a week, I should only pay the penalty the next time this code is modified and about to be executed. But after that, the performance should be the same as non modified code (+ the cost of a jump to that code).

Is my understanding correct?

score 3 · Answer 1 · answered Dec 01 '15 at 09:56

3

"The next time" is probably not the case; caching algorithms take into account accesses beyond the first one (not doing so would be rather naive). However, soon after the first few accesses the penalty should be gone. ("Few" might be two or thousands, but for a computer even a million is nothing.)

Even the code that is currently executing was written into memory at some point (perhaps even recently due to paging), so it experiences similar penalties initially, but that quickly dies down too, so you need not worry.

answered Dec 01 '15 at 09:56

user541686

205,094
128
528
886

@AmyLindsen: Glad it helped! :) – user541686 Dec 01 '15 at 10:14
turning these comments into an answer, since they ended up being a near-complete answer. – Peter Cordes Dec 01 '15 at 11:57
@PeterCordes: Note that the question said *"modify the code once a week"*... I wasn't sure if she intended it to be a *self-modification* per se; I thought she was concerned in general about modifying code that was already in cache/memory. But yeah, if she intended it to be modifying code that was currently executing then you're right. – user541686 Dec 01 '15 at 20:22
@Mehrdad: yup, I wondered the same thing. Like maybe this was a program that got re-compiled and re-execced once a week. – Peter Cordes Dec 01 '15 at 23:30

score 3 · Accepted Answer · answered Dec 01 '15 at 12:00

3

There's a difference between code that's simply not yet cached, vs. code that modifies instructions that are already speculatively in-flight (fetched, maybe decoded, maybe even sitting in the scheduler and re-order buffer in the out-of-order core). Writes to memory that's already being looked at as instructions by the CPU cause it to fall back to very slow operation. This is what's usually meant by self-modifying code. Avoiding this slowdown even when JIT-compiling is not too hard. Just don't jump to your buffer until after it's all written.

Modified once a week means you might have a one microsecond penalty once a week, if you do it wrong. It's true that frequently-used data is less likely to be evicted from the cache (that's why reading something multiple times is more likely to make it "stick"), but the self-modifying-code pipeline-flush should only apply the very first time, if you encounter it at all. After that, the cache lines being executed are in prob. still hot in L1 I-cache (and uop cache), if the 2nd run happens without much intervening code. It's not still in a modified state in L1 D-cache.

I forget if http://agner.org/optimize/ talks about self-modifying code and JIT. Even if not, you should read Agner's guides if you're writing anything in ASM. Some of the stuff in the main "optimizing asm" is getting out of date and not really relevant for Sandybridge and later Intel CPUs, though. Alignment / decode issues are less of an issue thanks to the uop cache, and alignment issues can be different for SnB-family microarches.

answered Dec 01 '15 at 12:00

Peter Cordes

328,167
45
605
847

I added a +1 for you (enabled when I will have enough rep). So one microsec/week is absolutely acceptable. I read agner's assembly guide already. Very valuable indeed. The code change should happen when I can take control and the flow is reduced on the machine. Good to know Peter :) – Amy Lindsen Dec 01 '15 at 17:09
@AmyLindsen if this answer is helpful it's totally okay to unaccept mine so you can accept it :) +1 – user541686 Dec 01 '15 at 20:19
@Mehrdad: Are you sure about it? I would not upset you Mehrdad. A virtual kiss in exchange, then? :) – Amy Lindsen Dec 01 '15 at 21:22
@AmyLindsen: Yes I'm definitely sure about it haha :) – user541686 Dec 01 '15 at 21:49
Something I'm struggling with underlying standing is how many different performance categories there are. You mention, for example, "code that's simply not yet cached" vs "code that modifies instructions that are already speculatively in-flight", but those categories don't seem complete: what about code that is cached in L1I, but not "in flight"? When things like SMC clears and "1K subpages" are discussed, does it apply to the tougher "in flight" case or the cached case? What are the penalties for the cached-but-not-in-flight case? What happens at the uop cache level? – BeeOnRope Oct 11 '17 at 18:58
Agner does mention CMS in **Special Topics: 17.10 Self-modifying code (All processors)**, but the treatment is brief. The only numbers given are for older architectures: _The penalty for executing a piece of code immediately after modifying it is approximately 19 clocks for P1, 31 for PMMX, and 150-300 for PPro, P2, P3, PM._ No definition of "immediately after modifying" is given, which is kind of the problem I am discussing. If you wait "long enough" before executing the modified code, do you still pay this cost? – BeeOnRope Oct 11 '17 at 19:08
@BeeOnRope: [According to Andy Glew](https://stackoverflow.com/questions/17395557/observing-stale-instruction-fetching-on-x86-with-self-modifying-code/18388700#18388700), modern Intel CPUs snoop the pipeline based on physical address (which ends up giving a stronger guarantee than older CPUs, without needing a jump anymore). L1I I is coherent with other caches, so code writes will simply invalidate L1I. Intel's optimization manual has some specifics about writing close to the current EIP/RIP; I think within the same page is to be avoided, or maybe within 2k. (This answer is a bit sloppy :/) – Peter Cordes Oct 11 '17 at 20:29
Yes, I have read all the documentation, but the costs still aren't clear (btw, the physical/virtual and `jmp` stuff is more or less orthogonal IMO, since it deals mostly with correctness). For example, there is clearly a cost to SMCing code that is "about to be executed" in terms of a machine clear. If the code is not "about to be executed" but still cached, is there the same cost, a lower cost, or almost no cost? The L1I is somehow "coherent" with the rest of the caches - but at what cost? Clearly it is not _common_ for writes to L1D to hit lines in L1I, so the mechanism might be slow. – BeeOnRope Oct 11 '17 at 20:32
BTW, I read all Intel's stuff. A lot of that language has been around for a while, so it isn't even clear how much applies only to older archs (Intel doesn't have a great track record at updating the entire massive guide on every arch to make clear what has changed - it is full of stale advice that would plainly be read as applying up until today). There is this tidbit: _Dynamic code need not cause the SMC condition if the code written fills up a data page before that page is accessed as code._ That seems to indicate that at least this "SMC condition" is fairly easy to trigger... – BeeOnRope Oct 11 '17 at 20:37
... since the conditions for avoiding it are more strict than just not writing to the same a 1KB/2KB subpage as the _currently executing_ code. Mostly I'm interesting to know how to execute _dynamically generated code_ with a minimum of overhead. This isn't strictly SMC since the code is freshly generated from scratch, and a modification of existing code, so one approach is just to never re-use memory at all: always write code to fresh pages, then execute it. That might be slower than re-using the same pages, however, due to caching - which is where "SMC-but-not-really" comes in. – BeeOnRope Oct 11 '17 at 20:39
@BeeOnRope: It's not common to hit, but it does *always* have to snoop, and to have L1I participate in MESIF. I guess it's possible L1I might be slower than usual to respond to MESIF Invalidate requests from other caches. I really don't know. Agreed that Intel's stuff is hard to sort out because of all the stale info. – Peter Cordes Oct 11 '17 at 20:39
@Bee: I'd guess that actual high-performance JIT engines have figured out something that works without causing a lot of SMC machine clears. I wonder if there are useful comments in the source of OpenJDK or something? – Peter Cordes Oct 11 '17 at 20:41
Well for modern JIT engines this isn't actually too important since they (a) usually to execute the compiled code many times so any generation costs are amortized and (b) many of them keep _most_ or at least a large fraction of the code they generate around indefinitely, so recycling isn't as much of a concern. Most just `mmap` and `munmap` so never have to overwrite code and this works fine for them. – BeeOnRope Oct 11 '17 at 21:27
@BeeOnRope: Ah right. I guess it would only come up in more specialized use-cases like JITing a loop to inline different loop-invariants every time, maybe with LLVM used manually. – Peter Cordes Oct 11 '17 at 21:29

x86_64 - Self-modifying code performance

2 Answers2

Linked