71

Given this code:

#include <string.h>

int equal4(const char* a, const char* b)
{
    return memcmp(a, b, 4) == 0;
}

int less4(const char* a, const char* b)
{
    return memcmp(a, b, 4) < 0;
}

GCC 7 on x86_64 introduced an optimization for the first case (Clang has done it for a long time):

    mov     eax, DWORD PTR [rsi]
    cmp     DWORD PTR [rdi], eax
    sete    al
    movzx   eax, al

But the second case still calls memcmp():

    sub     rsp, 8
    mov     edx, 4
    call    memcmp
    add     rsp, 8
    shr     eax, 31

Could a similar optimization be applied to the second case? What's the best assembly for this, and is there any clear reason why it isn't being done (by GCC or Clang)?

See it on Godbolt's Compiler Explorer: https://godbolt.org/g/jv8fcf

Jonas Schäfer
  • 20,140
  • 5
  • 55
  • 69
John Zwinck
  • 239,568
  • 38
  • 324
  • 436
  • 1
    What I find interesting is the casual disregard for alignment; this may work with x86, but on other CPUs the optimization may not be valid. – Matthieu M. Jul 12 '17 at 15:02
  • 19
    @MatthieuM. it only has to be valid on the target architecture – Caleth Jul 12 '17 at 15:21
  • @Caleth: Agree, but it makes me wonder at what stage the transformation is done. That is, whether gcc uses target-specific optimization in its middle-end (abstracted, perhaps), or if it's part of the lowering. – Matthieu M. Jul 12 '17 at 16:31
  • 9
    @MatthieuM. You can find out by compiling with `-fdump-tree-all -fdump-rtl-all` (in addition to all the other switches). That will dump the intermediate representation after each stage of optimization to a file in the current working directory, numbered so you can read through them in sequence. (You will get roughly 300 files if you do this. The "tree" dumps are much easier to read than the "RTL" dumps. You'll probably want to skim the "RTL" and "machine description" chapters of the [internals manual](https://gcc.gnu.org/onlinedocs/gccint) before trying to read the RTL dumps.) – zwol Jul 12 '17 at 18:05

4 Answers4

73

If you generate code for a little-endian platform, optimizing four-byte memcmp for inequality to a single DWORD comparison is invalid.

When memcmp compares individual bytes it goes from low-addressed bytes to high-addressed bytes, regardless of the platform.

In order for memcmp to return zero all four bytes must be identical. Hence, the order of comparison does not matter. Therefore, DWORD optimization is valid, because you ignore the sign of the result.

However, when memcmp returns a positive number, byte ordering matters. Hence, implementing the same comparison using 32-bit DWORD comparison requires a specific endianness: the platform must be big-endian, otherwise the result of comparison would be incorrect.

Sergey Kalinichenko
  • 714,442
  • 84
  • 1,110
  • 1,523
  • 12
    There's a `bswap` instruction in x86, and ARM has `rev`. One extra instruction, though. – MSalters Jul 12 '17 at 11:12
  • I understand your point, and it is a good one, but as @MSalters says, it's simple to reverse the bytes before doing the comparison as integers. Surely that would be better than calling `memcmp()`. So I'm not sure this really answers my question, which was "Could a similar optimization be applied to the second case? What's the best assembly for this?" – John Zwinck Jul 12 '17 at 12:38
  • 4
    @CodyGray: as dasblinkenlight points out, that's sufficient to tell `<0` and `>0` apart. Arithmetically, CMP looks for the most significant bit difference, while `memcmp` looks for the first differing byte in memory order. On big-endian systems, the first byte holds the MSB. `bswap` turns the native little-endian bit pattern into big-endian, what's why. – MSalters Jul 12 '17 at 13:14
  • Ah, sorry. It was failing my test cases because I was doing a *signed* comparison, rather than an unsigned comparison. Dumb mistake, but I didn't catch it even after going over my implementation a couple of times. – Cody Gray - on strike Jul 12 '17 at 13:31
  • 13
    @Kevin: You wouldn't want to swap the bytes in memory (and then restore them) anyway! The optimal asm might be something like: load both 4B blocks into registers, byte-swap them both. So it's extra instructions for the decoders and extra fused-domain uops for the front-end to issue, because you can't use a memory operand with `cmp` like for the `== 0` case. e.g. `mov edi, [rdi]` / `mov esi, [rsi]` / `bswap edi` /`bswap esi` / `cmp edi, esi` / seta and movzx. Those are all single-uop instructions on all recent Intel and AMD CPUs (http://agner.org/optimize/) – Peter Cordes Jul 12 '17 at 17:40
  • 5
    @Kevin and const-correct only applies to the source. The CPU can do whatever it likes as long as the end result is the same. – OrangeDog Jul 12 '17 at 18:11
  • @Peter You mean `setb` for the `< 0` case discussed here, not `seta`. :-) I wonder, though, if you might be better off with `cmp edi, esi` + `sbb eax, eax` + `neg eax`. The dependency chain is the same length, but the latter two instructions are 2 bytes, rather than 3 bytes. `SBB` is still 2 cycles pre-Broadwell (I thought `SETcc` was still high latency—ICC *always* uses `CMOV`), but it can run on more ports than `SETcc`. There's no elision of `MOVZX`, is there? (Although I guess you could trivially rewrite with a 2-byte `XOR` instead, which could be renamed, and may be the fastest.) – Cody Gray - on strike Jul 12 '17 at 19:05
  • @CodyGray: xor/cmp/setcc is best, unless you really need to avoid a tmp register. Skylake eliminates `movzx r32, r8-low`, except for extending in-place like here. :/ `sbb`/`neg` is only better on ancient CPUs where `setcc` is slow. IDK what ICC is smoking when it puts 0 and 1 into two registers and uses `cmov`. I have a whole section about setcc in my [xor-zeroing answer :P](https://stackoverflow.com/questions/33666617/what-is-the-best-way-to-set-a-register-to-zero-in-x86-assembly-xor-mov-or-and/33668295#33668295). But luckily this usually goes away when inlining into an `if()`: `cmp/jcc` – Peter Cordes Jul 12 '17 at 19:12
  • 2
    @OrangeDog - not really. If arguments are declared `const char *` they could very well be have been defined as `const` as well, which means they could be read-only and trying to modify them would cause a fault. In the real world, this is exactly what happens for things declared `const char *`: they are placed in the `.rodata` segment which is loaded without write-permissions. Working at the asm level does nothing to mitigate this. – BeeOnRope Jul 13 '17 at 00:31
  • 2
    @BeeOnRope you're assuming writes to main memory, whereas the CPU can still do whatever it wants internally. – OrangeDog Jul 13 '17 at 07:50
  • @OrangeDog - yes I'm assuming writes to the _array_ (not necessarily to "main memory"). I assume also Kevin was talking about writes to the array, since that's the thing that `const` protects at the source level, after all. Are you talking about something different? My point is that unlike many other languages, `const` in C is often enforced at the hardware level by putting const-defined data in read-only memory. Even at the assembly level you can't around that: the CPU is required to adhere to the memory protection settings (and of course in the case of real ROM, writes are impossible). – BeeOnRope Jul 13 '17 at 17:49
  • 1
    @BeeOnRope you made the same false assumptions as Kevin. You can mutate the array in registers or cache, and there is a level below assembly where the CPU can do whatever it wants. `const` is almost never enforced at the hardware level, because half the time it applies to parameters and local variables. Only your `const char *` string literals usually end up in read-only memory. For anything that doesn't you can change them there as long as you put it back afterwards (though that's would be pointlessly slow). – OrangeDog Jul 14 '17 at 08:07
  • 1
    You _cannot_ mutate the array in cache, the same protections are applied to memory that happens to be in cache as if it were not (after all it could be insane if the caches somehow bypassed memory protection). _Of course_ you can mutate the array in a register, _and you can ask for the same at the source level_ (e.g., by assigning to a temporary). The bottom line is that you can't write to a `const` object at the source level and often, neither can the CPU. Pointing out that you can read from it and manipulate the values in registers is simply obvious. @OrangeDog – BeeOnRope Jul 14 '17 at 16:40
  • I have no idea where you got your idea that this somehow applies only to `const char *` string literals. Modern compilers like `gcc` and `clang` put essentially _all_ objects with static storage duration defined as `const` into read-only memory which cannot be written even at the assembly level (and there is not "even lower level magic" that lets you bypass that). You are right that often `const` is not enforced when it is applied to a local or heap allocated object, but when you write a function you need it work for all inputs, right? It's probably not OK if it crashes once in a while... – BeeOnRope Jul 14 '17 at 16:44
  • It doesn't matter if an L1 cache bypasses memory protection as long as it doesn't affect anything else. Bottom line is the CPU can do whatever it wants as long as you cannot tell what it did. And the only person who ever mentioned the source level is Kevin. – OrangeDog Jul 14 '17 at 16:44
  • @BeeOnRope there certainly are lower levels. Are you completely unaware of micro architectures? You think when you run x86 machine code the CPU will only use four of its general purpose registers? You think it does exactly the main memory stores you tell it to, exactly when you tell it? You're arguing about the source level again when you discuss function contracts. – OrangeDog Jul 14 '17 at 16:47
  • Of course there is a hardware implementation with a lot of complexity, but it is generally invisible to the programmer and certainly doesn't let you bypass memory protection. Assembly (and related things like the layout and behavior of machine specific registers) is the lowest level you can reason about and the contract that the CPU provides. When you set up an area as read-only, userland code won't be able to bypass that without a serious flaw in the CPU despite any lower levels. – BeeOnRope Jul 14 '17 at 16:50
  • 1
    Listen - let me state what I understood to be your claim: "`const` is only a source level construct, but if you work at a lower level (like assembly) you can bypass it". While this is true in some other languages, it is not true in general in C or C++, where the compiler, linker, runtime loader, and _CPU_ all conspire to prevent you from modifying some objects defined as const. – BeeOnRope Jul 14 '17 at 16:52
24

Endianness is the problem here. Consider this input:

a = 01 00 00 03
b = 02 00 00 02

If you compare these two arrays by treating them as 32-bit integers, then you'll find that a is larger (because 0x03000001 > 0x02000002). On a big-endian machine, this test would probably work as expected.

r3mainer
  • 23,981
  • 3
  • 51
  • 88
  • 3
    That's true, but the question was about optimizing away the `memcmp()` call. It could still be done by emitting byte-swap instructions before doing the comparison, right? – John Zwinck Jul 12 '17 at 12:36
  • 3
    @JohnZwinck I think byte-swapping for this would be handling of a _very special_ case, which the compiler writers don't bother with. – Ruslan Jul 12 '17 at 13:07
  • 13
    @Ruslan: Compilers are full of such small optimizations; I'm pretty sure the compiler developers would be happy to receive a patch to cover this... if it really works. – Matthieu M. Jul 12 '17 at 15:01
  • 1
    @MatthieuM.: if this is important in real code, you'll get better results from current compilers by using `endian.h` or similar byte-swap functions to get two `uint32_t` integers to compare. [See my answer](https://stackoverflow.com/questions/45052282/why-is-memcmpa-b-4-only-sometimes-optimized-to-a-uint32-comparison/45066631#45066631) for an example. But then you have to worry about misaligned loads and stuff if writing portable code, so it would be nice if you could compare unknown-alignment big-endian integers using `memcmp` and get optimal code. – Peter Cordes Jul 12 '17 at 20:08
16

As discussed in other answers/comments, using memcmp(a,b,4) < 0 is equivalent to an unsigned comparison between big-endian integers. It couldn't inline as efficiently as == 0 on little-endian x86.

More importantly, the current version of this behaviour in gcc7/8 only looks for memcmp() == 0 or != 0. Even on a big-endian target where this could inline just as efficiently for < or >, gcc won't do it. (Godbolt's newest big-endian compilers are PowerPC 64 gcc6.3, and MIPS/MIPS64 gcc5.4. mips is big-endian MIPS, while mipsel is little-endian MIPS.) If testing this with future gcc, use a = __builtin_assume_align(a, 4) to make sure gcc doesn't have to worry about unaligned-load performance/correctness on non-x86. (Or just use const int32_t* instead of const char*.)

If/when gcc learns to inline memcmp for cases other than EQ/NE, maybe gcc will do it on little-endian x86 when its heuristics tell it the extra code size will be worth it. e.g. in a hot loop when compiling with -fprofile-use (profile-guided optimization).


If you want compilers to do a good job for this case, you should probably assign to a uint32_t and use an endian-conversion function like ntohl. But make sure you pick one that can actually inline; apparently Windows has an ntohl that compiles to a DLL call. See other answers on that question for some portable-endian stuff, and also someone's imperfect attempt at a portable_endian.h, and this fork of it. I was working on a version for a while, but never finished/tested it or posted it.

Pointer-casting to const uint32_t* would be Undefined Behaviour, if the bytes were written as anything but aligned uint32_t or through char*. If you're not sure about strict-aliasing and/or alignment, memcpy into abytes or use GNU C attributes: see another Q&A about alignment and strict-aliasing for workarounds. Most compilers are good at optimizing away small fixed-size memcpy.

// I know the question just wonders why gcc does what it does,
// not asking for how to write it differently.
// Beware of alignment performance or even fault issues outside of x86.

#include <endian.h>
#include <stdint.h>

static inline
uint32_t load32_native_endian(const void *vp){
   typedef uint32_t unaligned_aliasing_u32 __attribute__((aligned(1),may_alias));
   const unaligned_aliasing_u32 *up = vp;
   return *up;   // #ifndef __GNUC__  then use memcpy
}


int equal4_optim(const char* a, const char* b) {
    uint32_t abytes = load32_native_endian(a);
    uint32_t bbytes = load32_native_endian(b);

    return abytes == bbytes;
}


int less4_optim(const char* a, const char* b) {
    uint32_t a_native = be32toh(load32_native_endian(a));
    uint32_t b_native = be32toh(load32_native_endian(b));

    return a_native < b_native;
}

I checked on Godbolt, and that compiles to efficient code (basically identical to what I wrote in asm below), especially on big-endian platforms, even with old gcc. It also makes much better code than ICC17, which inlines memcmp but only to a byte-compare loop (even for the == 0 case.


I think this hand-crafted sequence is an optimal implementation of less4() (for the x86-64 SystemV calling convention, like used in the question, with const char *a in rdi and b in rsi).

less4:
    mov   edi, [rdi]
    mov   esi, [rsi]
    bswap edi
    bswap esi
    # data loaded and byte-swapped to native unsigned integers
    xor   eax,eax    # solves the same problem as gcc's movzx, see below
    cmp   edi, esi
    setb  al         # eax=1 if *a was Below(unsigned) *b, else 0
    ret

Those are all single-uop instructions on Intel and AMD CPUs since K8 and Core2 (http://agner.org/optimize/).

Having to bswap both operands has an extra code-size cost vs. the == 0 case: we can't fold one of the loads into a memory operand for cmp. (That saves code size, and uops thanks to micro-fusion.) This is on top the two extra bswap instructions.

On CPUs that support movbe, it can save code size: movbe ecx, [rsi] is a load + bswap. On Haswell, it's 2 uops, so presumably it decodes to the same uops as mov ecx, [rsi] / bswap ecx. On Atom/Silvermont, it's handled right in the load ports, so it's fewer uops as well as smaller code-size.

See the setcc part of my xor-zeroing answer for more about why xor/cmp/setcc (which clang uses) is better than cmp/setcc/movzx (typical for gcc).

In the usual case where this inlines into code that branches on the result, the setcc + zero-extend are replaced with a jcc; the compiler optimizes away creating a boolean return value in a register. This is yet another advantage of inlining: the library memcmp does have to create an integer boolean return value which the caller tests, because no x86 ABI/calling convention allows for returning boolean conditions in flags. (I don't know of any non-x86 calling conventions that do that either). For most library memcmp implementations, there's also significant overhead in choosing a strategy depending on length, and maybe alignment checking. That can be pretty cheap, but for size 4 it's going to be more than the cost of all the real work.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 4
    I dug into the GCC source code a bit. This optimization is implemented by [`handle_builtin_memcmp` in the somewhat inaccurately named `tree-ssa-strlen.c`](https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/tree-ssa-strlen.c#l2078) and if I'm reading it right, it only implements the `==` and `!=` cases: the checks on lines 2102-3 and 2108-9 cause it to bail out without doing anything if the comparison isn't `EQ_EXPR` or `NE_EXPR`, which mean what they sound like. Later it also bails out if `!SLOW_UNALIGNED_ACCESS (mode, align)` which means "can we do this load without worrying about alignment?" – zwol Jul 12 '17 at 21:49
  • 1
    @zwol: thanks! I'm not surprised the first implementation of this new feature only handles `==` / `!=` comparisons. It's too bad there aren't *truly* portable endian functions that would make it easy to write the `less4` in a compiler-friendly way without a mess of `#ifdef`s. – Peter Cordes Jul 12 '17 at 21:54
  • 2
    Incidentally, the aliasing rules are asymmetric: `char *` can alias anything, but `int *` officially _can't_ alias `char`, at least when the `char` is declared as such; see https://stackoverflow.com/questions/30967447/undefined-behavior-on-reading-object-using-non-character-type-when-last-written – zwol Jul 12 '17 at 23:29
  • 1
    FWIW the portable snippets library has an [endian](https://github.com/nemequ/portable-snippets/tree/master/endian) module that seems to do a reasonable job at doing this "fast" (as does the rest of the library), and seems to be of high quality and active maintained. – BeeOnRope Jul 13 '17 at 17:59
-2

Endianness is one problem, but signed char is another. For example, consider that the four bytes you compare are 0x207f2020 and 0x20802020. The 80 as signed char is -128, the 7f as signed char is +127. But if you compare the four bytes, no comparison will give you the right order.

Of course you can do an xor with 0x80808080 and then you can just use an unsigned compare.

gnasher729
  • 51,477
  • 5
  • 75
  • 98
  • 11
    `memcmp` is required to compare as `unsigned char`, regardless of whether `char` is signed. –  Jul 12 '17 at 21:58