AMD64 Assembler: if- Statement

Question

I have two 8-bit registers and have to check, if one of them is 0.

My solution by now is:

cmp $0, %r10b
je end
cmp $0, %r11b
je end

Is there any other way to do it?

regards

Yes, but it is unlikely to be more efficient, due to speculative execution on any x86-64 implementation. `cmp` updates *all* the flags, so there are no partial flags register stalls either. — Brett Hale, Mar 26 '14 at 13:10

score 1 · Answer 1 · edited May 23 '17 at 12:31

Performance discussions in this answer are for recent Intel CPUs (Sandybridge, Haswell). Mostly applicable to at least as far back as Pentium M, or even earlier P6 (Pentium Pro / Pentium II). See http://agner.org/optimize/ for microarch docs. Performance considerations should be similar on AMD, except they don't macro-fuse test&branch instructions into a single macro-op the way Intel macro-fuses them into a single uop.

Branch predictors exist on every pipelined design, but are more important on something like Haswell than old pre-Silvermont Atom. Still, that part is pretty universal.

Small tweak to your version:

test %r10b, %r10b   ; test is shorter than cmp with an immediate, but no faster
jz   end
test %r11b, %r11b
jz   end

Probably only one of the test/jz pairs will macro-fuse on Intel, because they'll probably both hit the decoders in the same cycle. Also, if either value was the output of an ALU op, it probably already set the zero-flag. So arrange your code so one of the branches doesn't need to separately test.

You can save a branch (at the expense of an extra uop). Throughput of even non-taken branches could be a bottleneck in a really tight loop. Sandybridge can only sustain 1 branch per 1-2 cycles. So this idea might possibly help for:

test  %r10b, %r10b
setnz %r15b          # 1 if %r10b == 0, else 0
dec   %r15b          # 0 if %r10b == 0, else 0xFF
test  %r11b, %r15b
je    end

This is one more instruction (all single-uop instructions with 1 cycle latency, though.) It adds more latency before the branch instruction can be retired (increasing the mispredict penalty by 3 cycles), but it could increase performance:

why a single branch is good:

If a && b is predictable, but it's unpredictable which of a or b will actually be zero, this can reduce the amount of branch mispredicts. Benchmark / perf-counter test it, though: programmers are said to be notoriously bad at guessing which branches will be predictable in their code. CPUs have a limited size branch-history-buffer, so using one fewer entry can help a tiny bit.

Optimized for throughput with slightly worse latency:

If latency isn't critical, just throughput (i.e. mispredicts are infrequent):

# mov     %r10b, %al    # have the byte you want already stored in %r10b
imul    %r11b           # Intel: 3 cycle latency, 1/cycle throughput.
# ZF is undefined, not set according to the result, unfortunately
test    %ax, %ax        # No imm16, so no Intel length-changing-prefix stall for a 16bit insn
je    .end

Total of 2 uops (test/je can macro-fuse, even on AMD). If you need to save the old value of %al, or you can't get one of your values in %al for free, then that's an extra mov.

If the upper bytes of your registers are zeroed, you might be able to gain speed: If you got your byte values into registers using byte operations, using imul %r10d, %r11d would create a partial-register stall (or extra uop to merge). If you wrote the full 32bit register (e.g. movzx), then you can use the 2-operand form of imul, and test the 32bit result. (The upper 16 will be all zero, which is fine.) There's no 2-operand form of imul r8, r8, and you need a full 16b of the result anyway, because it doesn't set the Zero flag according to the result. If it did, there might be a compare instruction that tested the right combination of Zero and Carry or Overflow flags. The manual says ZF is undefined after imul, so don't rely on what your current CPU happens to do. This is one case where you do need the upper bytes to be zero.

The operand-size prefix that makes test %ax, %ax operate on 16bit registers should not cause a decoding stall on Intel, because it doesn't change the length of the rest of the instruction. The dreaded LCP stall happens with 16bit immediates, like test $0xffff, %ax, so avoid those unless you're targeting only AMD.

@Brett Hale's comment on the OP: You only get partial flag stalls (or on later CPUs, an extra uop added to merge the flags (much more efficient)) if your branch instruction depends on flag bits that weren't modified by the last instruction to set the flags.

hmm, IDK what I was thinking with my first attempt at an answer. Thanks for catching my silly mistake. Fixed now. — Peter Cordes, Jul 20 '15 at 17:06
Yeah. @10K, I get to see 2 deleted answers that made the same initial error - and it's real easy to make trying to keep test/cmp/sub/jcc, etc., straight. I'm always sceptical of micro-optimizations, since they're a sliding target with each architecture. I tend to go with the most straightforward case being the best candidate for future optimizations. But this +1 answer is full of interesting details. Maybe you could further qualify them with processor architectures? — Brett Hale, Jul 20 '15 at 17:42
Good point, I sometimes forget to mention that I'm mostly looking at timings and throughput on recent Intel (SnB and onwards). I added that and some other thoughts. And yeah, it's too much work to write much code in asm. I'd only use tricks like this if I was already optimizing a hot loop, and it needed this operation. It's fun to play around with performance ideas, though. — Peter Cordes, Jul 20 '15 at 18:19
Don't get me wrong. I think there's plenty of room for this sort of expertise. The [GMP](https://gmplib.org/) project is one example that could *always* use your talents. Not to mention compiler technology. Beyond that, I love the idea that people still care about an extra cycle or two:) — Brett Hale, Jul 20 '15 at 18:36

Benoît · Answer 2 · 2016-02-29T19:12:41.700

0

You could and then first, and do a single test? Or you can also try to do the multiplication as suggested by @Peter Cordes, but instead of using imul, do a lea ?

But I would advise to keep your current code, just use test instead of cmp, do obfuscate it.

And actually, since test does a and, just do a test between your two registers, and then either jz or jnz, or even a cmov.

edited Feb 29 '16 at 19:12

answered Feb 29 '16 at 11:52

Benoît

3,355
2
29
34

Testing for both-zero is tricky. `or` doesn't preserve that information. Neither does `lea`, since it shifts and adds, but doesn't multiply two registers together. The problem is similar to trying to test that *all* of a certain set of bits are set. You typically have to mask, then check that the result == the mask. Or invert and `test` with the mask, and check `ZF`. – Peter Cordes Feb 29 '16 at 12:04
You are right, but then I meant "and". If one of the 2 registers is 0, the output is 0, and you just have to test it. – Benoît Feb 29 '16 at 18:41
1

It's not that easy, either. If it was, I'd use `test`, not `and`, so it didn't destroy either input. `and` / `test` is bitwise `&`, not C's `&&`. `1 & 2` is zero, but both are non-zero. Their set bits don't intersect. – Peter Cordes Feb 29 '16 at 23:08

AMD64 Assembler: if- Statement

2 Answers2

why a single branch is good:

Optimized for throughput with slightly worse latency: