x86 - Instruction-level parallelism - optimal order of instructions

Question

Which of the following two snippets of x86_64 code should be fastest? Or no difference at all?

; #1
    bsf    rax, rdi
    mov    rdx, -1
    cmove  rax, rdx

vs.

; #2
    mov    rdx, -1
    bsf    rax, rdi
    cmove  rax, rdx

(Or an alternative to #1, more economical with registers.

; #1a
    bsf    rax, rdi
    mov    rdi, -1
    cmove  rax, rdi

)

And yes, I know I should just benchmark them, but I don't have the tools and due to current long-term disabling illness I'm unable to set things up right now.

Questions like this are often best answered by simply writing the code and timing it over many iterations. — David Hoelzer, Aug 15 '16 at 15:55
@DavidHoelzer: I disagree: microbenchmarking is hard, and it's easily possible that one version would look faster than the other for some unrelated reason. It's also easy to get it wrong when latency and throughput are different for a sequence. The microbench might test throughput, while the actual use is latency sensitive. It's not a great question, but the answer is just "read Agner Fog's stuff", not "try to time it yourself". If you don't know the answer I gave, you won't be able to write a good microbenchmark other than by luck. — Peter Cordes, Aug 15 '16 at 16:18

score 3 · Accepted Answer · edited May 23 '17 at 12:22

See also the performance links in the x86 tag wiki, especially Agner Fog's microarch pdf and his Optimizing Assembly guide.

Unless decode / frontend effects come into play, they're all basically equal because of out-of-order execution. (Otherwise it depends on surrounding code, and is different for different microarchitectures.)

They all have the same amount of parallelism (two chains: independent mov (no inputs) and bsf (one input), plus a dependent cmov). It's small enough that it's trivial for out-of-order execution to find this parallelism. If you care about in-order Atom, then either way the bsf and mov can probably pair.

Any difference will depend on surrounding code.

If I had to pick, I might choose #1a, because that reduces the chance of the mov stealing an execution port from bsf. mov r64, imm32-sign-extended can run on any port on most CPUs, but bsf usually can't. Putting instructions on the critical path ahead of insns that aren't reduces resource conflicts, at least outside of loops where non-critical instructions from the previous iteration can delay the critical path. (The mov is sort of on the critical path, but it has no input deps, so out-of-order execution can run it at any point after it's issued, probably before the instructions that produce bsf's input.)

I'd probably use #1a over #1 to make that snippet use fewer registers for future-proofing. I'd use #1 if I had a specific use for starting a new dependency chain for some register, like if later instruction had a false dependency, and the register's value depended on a long dependency chain (or a load which could cache miss). e.g. if I wanted to use an 8-bit or 16-bit register, or an output register for popcnt.

Speaking of which, bsf probably also has a false dependency on Intel CPUs. If the input value is 0, Intel CPUs leave the destination unchanged. (The ISA says the dest is undefined, but this is what Core2 actually does, for example. This requires a dependency on the destination register, as well as the source). I suspect this is why lzcnt / tzcnt / popcnt have a dependency on the output register.

Speaking of false dependencies: fun fact, you can set a register to all-ones with fewer bytes of machine code by doing or rdx, -1 (or r64, imm8), with a false dependency on the dst register.. Normally a bad idea, don't do this.

I am familiar with Agner Fog's outstanding work. He's a hero, strongly recommended, — Cecil Ward, Aug 17 '16 at 00:05
@PeterCordes would instruction order come into play if the code size > instruction fetch block size (i think 16 bytes on skylake)? i.e ```[16 byte aligned]; mov rdx, -1; [filler instructions to 16 bytes]; bsf rax, rdi; cmove rax, rdx``` would perform worse than ```[16 byte aligned]; mov rdx, -1; bsf rax, rdi; [filler instructions to 16 bytes]; cmove rax, rdx``` — Noah, Dec 02 '20 at 16:32
@Noah: *possibly*, but still usually not. If latency of the critical path is the bottleneck (not front-end throughput), the front-end will typically have already issued instructions into the OoO back-end well ahead of where the critical path is executing (e.g. scheduler size of 97 uops in SKL, ROB of 224 uops). Also, most of the time code is running from the uop cache, not from legacy decode. (But blocks that only run once per whatever might always be cold in uop cache every time they run. If you have a lot of such functions, tuning them for legacy fetch/decode could be relevant.) — Peter Cordes, Dec 02 '20 at 19:46
@Noah: [Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths](https://stackoverflow.com/q/51986046) shows how OoO exec can overlap two long `imul` dep chains with no source / machine-code interleaving, up to a fairly large limit. (As long as you don't use `lfence` to block OoO exec.) — Peter Cordes, Dec 02 '20 at 19:49

x86 - Instruction-level parallelism - optimal order of instructions

1 Answers1