See also the performance links in the x86 tag wiki, especially Agner Fog's microarch pdf and his Optimizing Assembly guide.
Unless decode / frontend effects come into play, they're all basically equal because of out-of-order execution. (Otherwise it depends on surrounding code, and is different for different microarchitectures.)
They all have the same amount of parallelism (two chains: independent mov
(no inputs) and bsf
(one input), plus a dependent cmov). It's small enough that it's trivial for out-of-order execution to find this parallelism. If you care about in-order Atom, then either way the bsf and mov can probably pair.
Any difference will depend on surrounding code.
If I had to pick, I might choose #1a
, because that reduces the chance of the mov
stealing an execution port from bsf
. mov r64, imm32-sign-extended
can run on any port on most CPUs, but bsf
usually can't. Putting instructions on the critical path ahead of insns that aren't reduces resource conflicts, at least outside of loops where non-critical instructions from the previous iteration can delay the critical path. (The mov
is sort of on the critical path, but it has no input deps, so out-of-order execution can run it at any point after it's issued, probably before the instructions that produce bsf
's input.)
I'd probably use #1a
over #1
to make that snippet use fewer registers for future-proofing. I'd use #1
if I had a specific use for starting a new dependency chain for some register, like if later instruction had a false dependency, and the register's value depended on a long dependency chain (or a load which could cache miss). e.g. if I wanted to use an 8-bit or 16-bit register, or an output register for popcnt
.
Speaking of which, bsf
probably also has a false dependency on Intel CPUs. If the input value is 0, Intel CPUs leave the destination unchanged. (The ISA says the dest is undefined, but this is what Core2 actually does, for example. This requires a dependency on the destination register, as well as the source). I suspect this is why lzcnt
/ tzcnt
/ popcnt
have a dependency on the output register.
Speaking of false dependencies: fun fact, you can set a register to all-ones with fewer bytes of machine code by doing or rdx, -1
(or r64, imm8
), with a false dependency on the dst register.. Normally a bad idea, don't do this.