Coaxing GCC to emit REPE CMPSB

Question

How to coax the GCC compiler to emit the REPE CMPSB instruction in plain C, without the "asm" and "_emit" keywords, calls to an included library and compiler intrinsics?

I tried some C code like the one listed below, but unsuccessfully:

unsigned int repe_cmpsb(unsigned char *esi, unsigned char *edi, unsigned int ecx) {

    for (; ((*esi == *edi) && (ecx != 0)); esi++, edi++, ecx--); 

    return ecx;
}

See how GCC compiles it at this link:
https://godbolt.org/g/obJbpq

P.S.
I realize that there are no guarantees that the compiler compiles a C code in a certain way, but I'd like to coax it anyway for fun and just to see how smart it is.

`rep cmpsb` isn't fast; it's >= 2 cycles per byte throughput on Haswell, for example. (http://agner.org/optimize). Only `rep movs` and `rep stos` have "fast strings" support in current x86 CPUs. The "smart" thing for modern CPUs is to use SSE2 `pcmpeqb`. (But gcc and clang don't know how to vectorize loops with an iteration count that isn't known before loop entry; i.e. they can't vectorize search loops. ICC can, though.) — Peter Cordes, Mar 20 '18 at 02:37
@PeterCordes interestingly instlatx64 [says](http://users.atw.hu/instlatx64/GenuineIntel00306C3_HaswellXeon_InstLatX64.txt) Haswell can do 1 bye / cycle, I wonder which is right? — harold, Mar 20 '18 at 03:04
@harold: That says *bandwidth in L1D*. So they mean 2 cycles for 1 byte from each string. Agner describes it as cycles per `n = ecx`, i.e. cycles per compare (and that's what I meant by per byte; I didn't realize the ambiguity). InstLatx64 lists the `rep lodsb` bandwidth as 0.5 bytes / cycle, which is consistent with Agner's `~2n` number for `rep lods` throughput. — Peter Cordes, Mar 20 '18 at 03:19

score 9 · Answer 1 · answered Mar 20 '18 at 03:18

rep cmps isn't fast; it's >= 2 cycles per count throughput on Haswell, for example, plus startup overhead. (http://agner.org/optimize). You can get a regular byte-at-a-time loop to go at 1 compare per clock (modern CPUs can run 2 loads per clock) even when you have to check for a match and for a 0 terminator, if you write it carefully.

InstLatx64 numbers agree: Haswell can manage 1 cycle per byte for rep cmpsb, but that's total bandwidth (i.e. 2 cycles to compare 1 byte from each string).

Only rep movs and rep stos have "fast strings" support in current x86 CPUs. (i.e. microcoded implementations that internally use wider loads/stores when alignment and lack of overlap allow.)

The "smart" thing for modern CPUs is to use SSE2 pcmpeqb / pmovmskb. (But gcc and clang don't know how to vectorize loops with an iteration count that isn't known before loop entry; i.e. they can't vectorize search loops. ICC can, though.)

However, gcc will for some reason inline repz cmpsb for strcmp against short fixed strings. Presumably it doesn't know any smarter patterns for inlining strcmp, and the startup overhead may still be better than the overhead of a function call to a dynamic library function. Or maybe not, I haven't tested. Anyway, it's not horrible for code size in a block of code that compares something against a bunch of fixed strings.

#include <string.h>

int string_equal(const char *s) {
    return 0 == strcmp(s, "test1");
}

gcc7.3 -O3 output from Godbolt

.LC0:
    .string "test1"
string_cmp:
    mov     rsi, rdi
    mov     ecx, 6
    mov     edi, OFFSET FLAT:.LC0
    repz cmpsb
    setne   al
    movzx   eax, al
    ret

If you don't booleanize the result somehow, gcc generates a -1 / 0 / +1 result with seta / setb / sub / movzx. (Causing a partial-register stall on Intel before IvyBridge, and a false dependency on other CPUs, because it uses 32-bit sub on the setcc results, /facepalm. Fortunately most code only needs a 2-way result from strcmp, not 3-way).

gcc only does this with fixed-length string constants, otherwise it wouldn't know how to set rcx.

The results are totally different for memcmp: gcc does a pretty good job, in this case using a DWORD and a WORD cmp, with no rep string instructions.

int cmp_mem(const char *s) {
    return 0 == memcmp(s, "test1", 6);
}

    cmp     DWORD PTR [rdi], 1953719668  # 0x74736574
    je      .L8
.L5:
    mov     eax, 1
    xor     eax, 1          # missed optimization here after the memcmp pattern; should just xor eax,eax
    ret
.L8:
    xor     eax, eax
    cmp     WORD PTR [rdi+4], 49     # check last 2 bytes
    jne     .L5
    xor     eax, 1
    ret

Controlling this behaviour

The manual says that -mstringop-strategy=libcall should force a library call, but it doesn't work. No change in asm output.

Neither does -mno-inline-stringops-dynamically -mno-inline-all-stringops.

It seems this part of the GCC docs is obsolete. I haven't investigated further with larger string literals, or fixed size but non-constant strings, or similar.

Thanks for the long answer. It is interesting to note that repe cmpsb can be slower than some other code in a loop (e.g.: ?) but I don't think anything can beat repe cmpsb in terms of code size. Do you concur? — George Robinson, Mar 20 '18 at 07:43
@KarolaN: For short compile-time-constant strings, gcc's `memcmp` strategy (`cmp [mem], imm32`) is competitive, if you're counting code + rodata size. There's more overhead per string byte with immediates, though. It avoids the setup code, e.g. `mov ecx, 6` is 5 bytes. (If you're optimizing for code-size, you'd `xor eax,eax` / `lea ecx, [rax+6]` (5 bytes), and that avoids the `movzx` after the `setcc al`. Of course, normally you'd just branch on flags directly instead of using setcc, so you might `push 6` / `pop rcx` (3 bytes) if optimizing for code-size without caring about performance. — Peter Cordes, Mar 20 '18 at 09:05
Peter, I'd like to mark your answer as THE ANSWER, but it is missing a crucial piece, namely any plain C code (without opaque libcalls), that will compile to repe cmpsb. It doesn't have to be repe cmpsb alone. A repe cmpsw or repe cmpsd or even a different compiler than GCC is fine, too. — George Robinson, Mar 22 '18 at 00:15
I think the part at the top explaining why gcc/clang wouldn't want to do that for any of the current CPUs it cares about tuning for, for performance reasons, is the real answer, especially for non-tiny strings. Maybe clang `-Oz` (code-size regardless of performance) would benefit from that, if anyone's bothered to implement that pattern. But I wouldn't expect that anyone *has* implemented it, because it's a lot of code (in the compiler) to find such patterns but it's only useful for rare cases like code-size over speed. — Peter Cordes, Mar 22 '18 at 01:45
Update: [Why is this code using strlen heavily 6.5x slower with GCC optimizations enabled?](https://stackoverflow.com/q/55563598) is a case where GCC's inline expansion using `rep scasb` at `-O1` is disastrous for performance. That's since been fixed; GCC doesn't do that anymore. — Peter Cordes, Aug 23 '22 at 11:21

Coaxing GCC to emit REPE CMPSB

1 Answers1

Linked