I have two 8-bit registers and have to check, if one of them is 0.
My solution by now is:
cmp $0, %r10b
je end
cmp $0, %r11b
je end
Is there any other way to do it?
regards
I have two 8-bit registers and have to check, if one of them is 0.
My solution by now is:
cmp $0, %r10b
je end
cmp $0, %r11b
je end
Is there any other way to do it?
regards
Performance discussions in this answer are for recent Intel CPUs (Sandybridge, Haswell). Mostly applicable to at least as far back as Pentium M, or even earlier P6 (Pentium Pro / Pentium II). See http://agner.org/optimize/ for microarch docs. Performance considerations should be similar on AMD, except they don't macro-fuse test&branch instructions into a single macro-op the way Intel macro-fuses them into a single uop.
Branch predictors exist on every pipelined design, but are more important on something like Haswell than old pre-Silvermont Atom. Still, that part is pretty universal.
Small tweak to your version:
test %r10b, %r10b ; test is shorter than cmp with an immediate, but no faster
jz end
test %r11b, %r11b
jz end
Probably only one of the test
/jz
pairs will macro-fuse on Intel, because they'll probably both hit the decoders in the same cycle. Also, if either value was the output of an ALU op, it probably already set the zero-flag. So arrange your code so one of the branches doesn't need to separately test
.
You can save a branch (at the expense of an extra uop). Throughput of even non-taken branches could be a bottleneck in a really tight loop. Sandybridge can only sustain 1 branch per 1-2 cycles. So this idea might possibly help for:
test %r10b, %r10b
setnz %r15b # 1 if %r10b == 0, else 0
dec %r15b # 0 if %r10b == 0, else 0xFF
test %r11b, %r15b
je end
This is one more instruction (all single-uop instructions with 1 cycle latency, though.) It adds more latency before the branch instruction can be retired (increasing the mispredict penalty by 3 cycles), but it could increase performance:
If a && b
is predictable, but it's unpredictable which of a
or b
will actually be zero, this can reduce the amount of branch mispredicts. Benchmark / perf-counter test it, though: programmers are said to be notoriously bad at guessing which branches will be predictable in their code. CPUs have a limited size branch-history-buffer, so using one fewer entry can help a tiny bit.
If latency isn't critical, just throughput (i.e. mispredicts are infrequent):
# mov %r10b, %al # have the byte you want already stored in %r10b
imul %r11b # Intel: 3 cycle latency, 1/cycle throughput.
# ZF is undefined, not set according to the result, unfortunately
test %ax, %ax # No imm16, so no Intel length-changing-prefix stall for a 16bit insn
je .end
Total of 2 uops (test/je can macro-fuse, even on AMD). If you need to save the old value of %al, or you can't get one of your values in %al for free, then that's an extra mov.
If the upper bytes of your registers are zeroed, you might be able to gain speed: If you got your byte values into registers using byte operations, using imul %r10d, %r11d
would create a partial-register stall (or extra uop to merge). If you wrote the full 32bit register (e.g. movzx
), then you can use the 2-operand form of imul
, and test the 32bit result. (The upper 16 will be all zero, which is fine.) There's no 2-operand form of imul r8, r8
, and you need a full 16b of the result anyway, because it doesn't set the Zero flag according to the result. If it did, there might be a compare instruction that tested the right combination of Zero and Carry or Overflow flags. The manual says ZF is undefined after imul
, so don't rely on what your current CPU happens to do. This is one case where you do need the upper bytes to be zero.
The operand-size prefix that makes test %ax, %ax
operate on 16bit registers should not cause a decoding stall on Intel, because it doesn't change the length of the rest of the instruction. The dreaded LCP stall happens with 16bit immediates, like test $0xffff, %ax
, so avoid those unless you're targeting only AMD.
@Brett Hale's comment on the OP: You only get partial flag stalls (or on later CPUs, an extra uop added to merge the flags (much more efficient)) if your branch instruction depends on flag bits that weren't modified by the last instruction to set the flags.
You could and
then first, and do a single test
? Or you can also try to do the multiplication as suggested by @Peter Cordes, but instead of using imul
, do a lea
?
But I would advise to keep your current code, just use test
instead of cmp
, do obfuscate it.
And actually, since test
does a and
, just do a test
between your two registers, and then either jz
or jnz
, or even a cmov
.