The impact of jump instruction's distance on latency

Question

I'm unsure how much of an impact the distance of the jump instruction has on its performance. Let's say it's called every 10 milliseconds. What would the difference be in performance, between say a jump 100 or 1000 bytes forward and say 7.5 million bytes back?

Likely the dominating factor is not the distance but whether the target code section is in the cache. — 500 - Internal Server Error, Oct 14 '21 at 15:12
Oh, most definitely. Still, that's not what the question is about. — swaggg, Oct 14 '21 at 15:32
Jump instructions don't really have latency; branch prediction + speculative execution hide branch latency in pipelined out-of-order exec CPUs, which all modern x86 CPUs are. Do you actually mean branch latency on an in-order pipeline like Atom?? The number of cycles from execution of a `jmp` to when the execution of the branch target can happen? I think that's 1 cycle, because of branch prediction in the fetch stage. Or are you just mis-using "latency" to talk about "cost in cycles", [which isn't a thing on modern CPUs](https://stackoverflow.com/a/44980899/224132). — Peter Cordes, Oct 14 '21 at 20:37
Thanks for your input, Peter. Why would branch prediction apply to an unconditional jump? I don't think I'm "misusing" :P the term "latency", but either way the question was very simple. In an isolated environment, or considered separately, what impact on performance does the distance of the jump have, taking into account all the factors. Whether speculative execution etc makes it difficult to pinpoint the exact duration doesn't seem to be that relevant; the question was about the distance as a single factor. — swaggg, Oct 14 '21 at 22:09

score 4 · Accepted Answer · answered Oct 14 '21 at 15:46

The distance by itself has no effect. A jump consists of loading a new address into the program counter (IP/EIP/RIP on x86), or in case of a relative jump, adding the desired displacement to the program counter. Load and add are both constant-time operations whose speed does not depend on the value(s) involved.

There might be a slight effect if the distance means using a longer or shorter encoding. For instance, x86 in 32- or 64-bit mode has two encodings for relative jump: opcode 0xEB with an 8-bit displacement (total size 2 bytes, so-called "short jump"), and opcode 0xE9 with a 32-bit displacement (total size 5 bytes). Your 100-byte jump could use the shorter form and thus use 3 fewer bytes of code, which will tend to be slightly faster to fetch and leave more space for other code to fit in cache. The longer jumps would need the longer form.

The CPU then fetches its next instruction from the new address and continues execution. The whole point of random-access memory is that (ignoring caching) any part of it may be accessed in the same amount of time. Fetching bytes from an address 100 bytes away is no different than fetching from an address 100 megabytes away. It's not like a disk drive that must mechanically move a head to a new physical position, which may take longer if the distance is greater, nor a tape drive that has to traverse all the tape between the current position and the desired one. So there's no essential difference there either.

Of course, caching effects do come into play. Since modern CPUs do prefetching, an address that's a short positive distance away may have a better chance of already having been loaded into cache. On the other hand, with branch prediction and speculative execution, the CPU may have seen the jump instruction coming and started caching and fetching the instructions on the other side. It may make more of a difference which region of memory has been accessed more recently. (10 ms is not very recent - it's practically forever on a CPU timescale. Indeed, any instruction that you execute only once every 10 ms is, for practical purposes, so rare that there's not much need to even think about its performance.)

Fun fact: on Silvermont-family, jumping across a 4GiB boundary ([or jumping more than 4GiB away?](https://stackoverflow.com/a/18279617/224132) Not sure which) is slower than jumping nearby. If it's jumping more than 4GiB away which is slow, that's only possible via indirect branches. (So on that old CPU, it's best to map shared libraries near the main executable's code.) @swaggg — Peter Cordes, Oct 14 '21 at 21:18
On all CPUs, staying within the same 1G or at least 512G region can make page walks faster on a TLB miss, since the page walker caches upper levels of the page tables. Staying within the same part of that radix tree has an advantage. (But only on TLB miss; with hot TLBs then yeah hit-time is constant and it only matters whether the code is hot in whatever level of cache or not.) — Peter Cordes, Oct 14 '21 at 21:23

The impact of jump instruction's distance on latency

1 Answers1