Why is memcmp so much faster than a for loop check?

Question

Why is memcmp(a, b, size) so much faster than:

for(i = 0; i < nelements; i++) {
    if a[i] != b[i] return 0;
}
return 1;

Is memcmp a CPU instruction or something? It must be pretty deep because I got a massive speedup using memcmp over the loop.

Compile with -S to see the assembly-language output and find out. On `x86`, as others have mentioned, there are instructions for this, but often these can be vectorized. — Davislor, Apr 21 '18 at 03:31
But what optimization level are you using? Many compilers can unroll that loop. — Davislor, Apr 21 '18 at 03:39

Jonathon Reinhart · Accepted Answer · 2017-06-17T19:30:08.000

memcmp is often implemented in assembly to take advantage of a number of architecture-specific features, which can make it much faster than a simple loop in C.

As a "builtin"

GCC supports memcmp (as well as a ton of other functions) as builtins. In some versions / configurations of GCC, a call to memcmp will be recognized as __builtin_memcmp. Instead of emitting a call to the memcmp library function, GCC will emit a handful of instructions to act as an optimized inline version of the function.

On x86, this leverages the use of the cmpsb instruction, which compares a string of bytes at one memory location to another. This is coupled with the repe prefix, so the strings are compared until they are no longer equal, or a count is exhausted. (Exactly what memcmp does).

Given the following code:

int test(const void* s1, const void* s2, int count)
{
    return memcmp(s1, s2, count) == 0;
}

gcc version 3.4.4 on Cygwin generates the following assembly:

; (prologue)
mov     esi, [ebp+arg_0]    ; Move first pointer to esi
mov     edi, [ebp+arg_4]    ; Move second pointer to edi
mov     ecx, [ebp+arg_8]    ; Move length to ecx

cld                         ; Clear DF, the direction flag, so comparisons happen
                            ; at increasing addresses
cmp     ecx, ecx            ; Special case: If length parameter to memcmp is
                            ; zero, don't compare any bytes.
repe cmpsb                  ; Compare bytes at DS:ESI and ES:EDI, setting flags
                            ; Repeat this while equal ZF is set
setz    al                  ; Set al (return value) to 1 if ZF is still set
                            ; (all bytes were equal).
; (epilogue)

Reference:

cmpsb instruction

As a library function

Highly-optimized versions of memcmp exist in many C standard libraries. These will usually take advantage of architecture-specific instructions to work with lots of data in parallel.

In Glibc, there are versions of memcmp for x86_64 that can take advantage of the following instruction set extensions:

The cool part is that glibc will detect (at run-time) the newest instruction set your CPU has, and execute the version optimized for it. See this snippet from sysdeps/x86_64/multiarch/memcmp.S:

ENTRY(memcmp)
    .type   memcmp, @gnu_indirect_function
    LOAD_RTLD_GLOBAL_RO_RDX
    HAS_CPU_FEATURE (SSSE3)
    jnz 2f
    leaq    __memcmp_sse2(%rip), %rax
    ret 

2:  HAS_CPU_FEATURE (SSE4_1)
    jz  3f  
    leaq    __memcmp_sse4_1(%rip), %rax
    ret 

3:  leaq    __memcmp_ssse3(%rip), %rax
    ret 

END(memcmp)

In the Linux kernel

Linux does not seem to have an optimized version of memcmp for x86_64, but it does for memcpy, in arch/x86/lib/memcpy_64.S. Note that is uses the alternatives infrastructure (arch/x86/kernel/alternative.c) for not only deciding at runtime which version to use, but actually patching itself to only make this decision once at boot-up.

@cHao Wow, thanks I almost continued writing my whole answer as if this were `memcpy`. — Jonathon Reinhart, Jan 14 '14 at 05:46
It would be interesting to profile the builtin version with non builtin version (`-fno-builtin`). At some point the builtin version was much slower. I don't know if it has improved. — Z boson, Jan 14 '14 at 13:12
I was reading the bug report, and it appears that it's quite complicated and there are lots of corner cases. (including constant lengths, etc.) — Jonathon Reinhart, Jan 14 '14 at 13:52
IIRC instructions like `rep cmpsb` are actually quite slow. gcc now generates a call to the libc version of memcmp, which (in glibc) has an optimized asm implementation (using SIMD, not `rep cmpsb`). — Marc Glisse, Nov 12 '14 at 15:09
That is not universally true. Modern CPUs have a "fast string operations" feature that puts the rep * versions back at the top. The Linux kernel detects your whether your CPU supports this feature and live patches the appropriate code in place. (Although that may be only for movsb and friends) — Jonathon Reinhart, Nov 12 '14 at 16:23
I'll try and dig up some documentation, but Googling for that quoted string above should turn up something good. — Jonathon Reinhart, Nov 12 '14 at 16:24
As mentioned by @MarcGlisse , 4.8 with `-O3` makes a `jmp memcmp`. This is also mentioned at: http://stackoverflow.com/a/6334452/895245, which links to: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052 — Ciro Santilli OurBigBook.com, Apr 24 '15 at 20:38
All: I have thoroughly updated my answer with the current state of affairs. Feedback is welcome. — Jonathon Reinhart, Jan 28 '16 at 12:31
Marc Glisse is correct; but "fast strings" only applies to `rep`, not `repz`/`repnz`. `rep movsb` / `rep stosb` are fast (especially with ERMSB on Ivybridge+), but `repz cmpsb` isn't. See http://agner.org/optimize/ for instruction tables: On Skylake, `repz cmps` has a runtime of `>=2n` cycles, taking `>= 8n` uops. (Where `n` is the element count, `rcx` if it goes to the end, i.e. 1 byte per 2 cycles for `cmpsb`.) But `rep movs` has a best-case of `1/32B` (copy 32 bytes per cycle). — Peter Cordes, Nov 27 '17 at 10:08
Inlining `rep cmpsb` is a poor choice except for very short blocks to save the function-call overhead. It's *not* fast. It can also avoid a branch mispredict, though, again a possible advantage for short blocks. — Peter Cordes, Nov 27 '17 at 10:09

user207421 · Answer 2 · 2014-01-14T06:00:06.590

0

Is memcmp a CPU instruction or something?

It is at least a very highly optimized compiler-provided intrinsic function. Possibly a single machine instruction, or two, depending on the platform, which you haven't specified.

edited Jan 14 '14 at 06:00

answered Jan 14 '14 at 05:44

user207421

305,947
44
307
483

score 0 · Answer 3 · edited May 23 '17 at 11:46

0

It's usually a compiler intrinsic that is translated into fast assembly with specialized instructions for comparing blocks of memory.

intrinsic memcmp

edited May 23 '17 at 11:46

Community

1
1

answered Jan 14 '14 at 05:45

a_mole

505
4
11

`memcmp` is a GCC *builtin*, not intrinsic. *intrinsic* typically refers to C-level access to specific CPU instructions. – Jonathon Reinhart Jan 14 '14 at 06:07
and intrinsic is what they're called in Visual C++ – a_mole Jan 14 '14 at 15:42

Why is memcmp so much faster than a for loop check?

3 Answers3

As a "builtin"

As a library function

In the Linux kernel

Linked

Related