2

I have this memchr code that I'm trying to make non-branching:

.globl memchr
memchr:
        mov %rdx, %rcx
        mov %sil, %al
        cld
        repne scasb
        lea -1(%rdi), %rax
        test %rcx, %rcx
        cmove %rcx, %rax
        ret

I'm unsure whether or not cmove is a branching instruction. Is it? If so, how do I rearrange my code so it doesn't branch?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
S.S. Anne
  • 15,171
  • 8
  • 38
  • 76
  • 1
    You don't need `cld`; all the standard calling conventions guarantee/require DF=0 on call/ret. Also, `movzbl %sil, %eax` would be more efficient than merging into the low byte of RAX. Or just `mov %esi, %eax` is good except if you caller only wrote AL on a P6-family CPU. – Peter Cordes Aug 16 '19 at 14:29
  • 1
    I assume downvoted for lack of research effort. e.g. google for `is cmov a branch` has several hits which all make it obvious, including [Why is a conditional move not vulnerable for Branch Prediction Failure?](//stackoverflow.com/q/14131096) (which is a possible duplicate). I don't think there's any real way to improve the question. Including any specific wrong claims or misleading sources would just lead to a more bloated answer that refutes them. – Peter Cordes Aug 17 '19 at 00:19

1 Answers1

17

No, it's not a branch, that's the whole point of cmovcc.

It's an ALU select that has a data dependency on both inputs, not a control dependency. (With a memory source, it unconditionally loads the memory source, unlike ARM predicated load instructions that are truly NOPed. So you can't use it with maybe-bad pointers for branchless bounds or NULL checks. That's maybe the clearest illustration that it's definitely not a branch.)

But anyway, it's not predicted or speculated in any way; as far as the CPU scheduler is concerned it's just like an adc instruction: 2 integer inputs + FLAGS, and 1 integer output. (Only difference from adc/sbb is that it doesn't write FLAGS. And of course runs on an execution unit with different internals).

Whether that's good or bad entirely depends on the use-case. See also gcc optimization flag -O3 makes code slower than -O2 for much more about cmov upside / downside


Note that repne scasb is not fast. "Fast Strings" only works for rep stos / movs.

repne scasb runs about 1 count per clock cycle on modern CPUs, i.e. typically about 16x worse than a simple SSE2 pcmpeqb/pmovmskb/test+jnz loop. And with clever optimization you can go even faster, up to 2 vectors per clock saturating the load ports.

(e.g. see glibc's memchr for ORing pcmpeqb results for a whole cache line together to feed one pmovmskb, IIRC. Then go back and sort out where the actual hit was.)

repne scasb also has startup overhead, but microcode branching is different from regular branching: it's not branch-predicted on Intel CPUs. So this can't mispredict, but is total garbage for performance with anything but very small buffers.

SSE2 is baseline for x86-64 and efficient unaligned loads + pmovmskb make it a no-brainer for memchr where you can check for length >= 16 to avoid crossing into an unmapped page.

Fast strlen:

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 2
    *"**Note that `repne scasb` is not fast.**"* -- I know. I'm planning on replacing it with something faster later, but for now it's small and it works. – S.S. Anne Aug 16 '19 at 13:24
  • I also don't have any experience with any SSE/2/3/4.1 stuff. – S.S. Anne Aug 16 '19 at 14:58
  • 1
    @JL2210: `memchr` is a pretty good way to learn about using SSE2 for simple element-match searching; knowing the size as a function arg (instead of implicit length) makes it significantly simpler than `strchr`. – Peter Cordes Aug 16 '19 at 15:07
  • 1
    @JL2210 You might be interested in [this answer](https://codereview.stackexchange.com/a/213558), which discusses ways of optimizing `strlen` using basic x86 instructions (i.e., without using any SSE). A similar approach could be used for `memchr`, except that you can't optimize so heavily around searching for a 0-byte. Certainly, as Peter says, SSE is the way to make it *really* fast. I've written and benchmarked that code, too. If you want to know more, ask more questions about it. I could expand that answer another 10 pages, but I'd eventually run out of room and folks would tire of reading. – Cody Gray - on strike Oct 22 '19 at 17:27
  • @CodyGray Thanks for the offer. – S.S. Anne Oct 22 '19 at 19:14
  • 1
    @JL2210: Updated my answer with some links to SIMD strlen as described in this answer. `memchr` using the zero-byte-in-word bithack usually just does an `xor` with the byte being searched for. You can broadcast it with `c * 0x01010101UL` or whatever. – Peter Cordes Oct 22 '19 at 19:26
  • @JL2210: x86-64 guarantees availability of SSE2 for `pcmpeqb`. If you need 4-byte or 8-byte chunks, `movd` or `movq`+`pcmpeqd` should beat a scalar bithack even in the no-false-positive case. It's also pointless if you know SSE2 is available in 32-bit code, so updated my answer to say "with SSE2". – Peter Cordes Nov 04 '19 at 17:34
  • @PeterCordes Say I made a C version of that same code. Would it still be reasonably more efficient than a byte-for-byte loop? – S.S. Anne Nov 04 '19 at 17:38
  • @JL2210: Yes, for all but the shortest of strings, because current compilers (except ICC) can't auto-vectorize search loops. But this is an assembly-language question. But to make an `unsigned long` version without violating strict aliasing in C you'd need `memcpy` for the aliasing loads. Or you could use Intel SIMD intrinscs. Or using GNU C `typedef unsigned long aliasing_long __attribute((may_alias))`. Although that would be useful for a *portable* version: [Why does glibc's strlen need to be so complicated to run quickly?](//stackoverflow.com/a/57676035) – Peter Cordes Nov 04 '19 at 18:06