clang vs gcc for copying 3 bytes on x86_64 - number of mov's

Question

What should optimized compiled code for copying 3 bytes from one place to another, say, using memcpy(,,3) look like, in terms of assembly instructions?

Consider the following program:

#include <string.h>
int main() {
  int* p = (int*) 0x10;
  int x = 0;
  memcpy(&x, p, 4);
  x = x * (x > 1 ? 2 : 3);
  memcpy(p, &x, 4);  
  return 0;
}

it's a bit contrived a will cause a segmentation violation, but I need those instructions so that compiling with -O3 doesn't make all of it go away. When I compile this (GodBolt, GCC 6.3 -O3), I get:

main:
        mov     edx, DWORD PTR ds:16
        xor     eax, eax
        cmp     edx, 1
        setle   al
        add     eax, 2
        imul    eax, edx
        mov     DWORD PTR ds:16, eax
        xor     eax, eax
        ret

great - a single mov of a DWORD (= 4 bytes) from memory to a register. Nice and optimized. Now let's change the memcpy(&x, p1, 4) into memcpy(&x, p1, 3)? The compilation result becomes:

main:
        mov     DWORD PTR [rsp-4], 0
        movzx   eax, WORD PTR ds:16
        mov     WORD PTR [rsp-4], ax
        movzx   eax, BYTE PTR ds:18
        mov     BYTE PTR [rsp-2], al
        mov     edx, DWORD PTR [rsp-4]
        xor     eax, eax
        cmp     edx, 1
        setle   al
        add     eax, 2
        imul    eax, edx
        mov     DWORD PTR ds:16, eax
        xor     eax, eax
        ret

I'm not much of an exprt on Intel X86_64 assembly (read: I can't even read it properly when it's complicated), so - I don't quite get this. I mean, I get what's happening in the first 6 instructions and why so many of them are necessary. Why aren't two moves sufficient? A mov WORD PTR int al and a mov BYTE PTR into ah?

... so, I came here to ask. As I was writing the question I noticed GodBolt also has clang as an option. Well, clang (3.9.0 -O3) does this:

main:                                   # @main
        movzx   eax, byte ptr [18]
        shl     eax, 16
        movzx   ecx, word ptr [16]
        or      ecx, eax
        cmp     ecx, 2
        sbb     eax, eax
        and     eax, 1
        or      eax, 2
        imul    eax, ecx
        mov     dword ptr [16], eax
        xor     eax, eax
        ret

which looks more like what I expected. What explains the difference?

Notes:

It's the same behavior essentially if I don't initialize x = 0.
Other GCC versions do about the same thing as GCC 6.3, but GCC 7 is down to 5 instead of 6 mov's.
Other versions of clang (starting from 3.4) do about the same thing.

The behavior is similar if we forego memcpy'ing for the following:

#include <string.h>

typedef struct {
  unsigned char data[3];
}  uint24_t;

int main() {
  uint24_t* p = (uint24_t*) 0x30;
  int x = 0;
  *((uint24_t*) &x) = *p;
  x = x * (x > 1 ? 2 : 3);
  *p = *((uint24_t*) &x);
  return 0;
}

If you want to contrast with what happens when the relevant code is in a function, have a look at this or the uint24_t struct version (GodBolt). Then have a look at what happens for 4-byte values.

Write functions that take args and return a value so you don't have to make them super-weird to not optimize away. e.g. a function that takes two `int*` args and does a 3 byte memcpy between them. `void foo(int *p, int *q) { memcpy(p, q, 3); }`. https://godbolt.org/g/NR4Lcc. Or add a `*p = 0` before the 3-byte memcpy to get code like you're seeing. Or maybe have it then return `*p`, so it needs the result in memory and in a register. — Peter Cordes, Dec 31 '16 at 12:18
@PeterCordes: With your link, I don't get the kind of code I'm seeing. That function rightly has 2 mov's in, 2 mov's out (unless you can move directly between address, which IIRC is not possible). Have a look at this though https://godbolt.org/g/5EDSvC — einpoklum, Dec 31 '16 at 12:36
Is that last comment still relevant? I expanded on it to make an answer, and even that comment explains that to get what you're seeing, you need `*p=0` first and `return *p` after. — Peter Cordes, Dec 31 '16 at 13:07
The godbolt link in your comment has undefined behaviour: you read `x` without having initialized its final byte. Other than that, nothing new to see there. Just some 3-byte copies done as 2 + 1 bytes. — Peter Cordes, Dec 31 '16 at 13:07
How is the `x > 1` meant for 3 byte input, is the 3 byte value 24bit signed, or 24b unsigned? If it's 24b signed, the `and eax,0x00FFFFFF` is not correct. If it's unsigned, then the f(n) is: 0 -> 0, 1 -> 3, n=2..(2^24-1) -> n*2 (truncated to 24bit, if 3B are written back). — Ped7g, Jan 01 '17 at 03:40
@Ped7g: That's a completely arbitrary instruction, I don't really care what it yields. The point is that it depends on the value and may modify it. — einpoklum, Jan 01 '17 at 09:04

Peter Cordes · Answer 1 · 2016-12-31T12:47:38.520

You should get much better code from copying 4 bytes and masking off the top one, e.g. with x & 0x00ffffff. That lets the compiler know it's allowed to read 4 bytes, not just the 3 that the C source reads.

Yes, that helps a ton: it saves gcc and clang from storing a 4B zero, then copying three bytes and reloading 4. They just load 4, mask, store, and use the value that's still in the register. Part of that might be from not knowing if *p aliases *q.

int foo(int *p, int *q) {
  //*p = 0;
  //memcpy(p, q, 3);
  *p = (*q)&0x00ffffff;
  return *p;
}

    mov     eax, DWORD PTR [rsi]     # load
    and     eax, 16777215            # mask
    mov     DWORD PTR [rdi], eax     # store
    ret                              # and leave it in eax as return value

Why aren't two moves sufficient? A mov WORD PTR into al followed by a mov BYTE PTR into ah?

AL and AH are 8-bit registers. You can't put a 16-bit word into AL. This is why your last clang-output block loads two separate registers and merges with a shift+or, in the case where it knows it's allowed to mess with all 4 bytes of x.

If you were merging two separate one-byte values, you could load them to AL and AH and then use AX, but that leads to a partial-register stall on Intel pre-Haswell.

You could do a word-load into AX (or preferably movzx into eax for various reasons including correctness and avoiding a false dependency on the old value of EAX), left-shift EAX, and then a byte-load into AL.

But compilers don't tend to do that, because partial-register stuff has been very bad juju for many years, and is only efficient on recent CPUs (Haswell, and maybe IvyBridge). It would cause serious stalls on Nehalem and Core2. (See Agner Fog's microarch pdf; search for partial-register or look for it in the index. See other links in the x86 tag wiki.) Maybe in a few years, -mtune=haswell will enable partial-register tricks to save the OR instruction that clang uses to merge.

Instead of writing such a contrived function:

Write functions that take args and return a value so you don't have to make them super-weird to not optimize away. e.g. a function that takes two int* args and does a 3 byte memcpy between them.

This on Godbolt (with gcc and clang), with colour highlighting

void copy3(int *p, int *q) { memcpy(p, q, 3); }

 clang3.9 -O3 does exactly what you expected: a byte and a word copy.
    mov     al, byte ptr [rsi + 2]
    mov     byte ptr [rdi + 2], al
    movzx   eax, word ptr [rsi]
    mov     word ptr [rdi], ax
    ret

To get the sillyness you managed to generate, zero the destination first, and then read it back after a three byte copy:

int foo(int *p, int *q) {
  *p = 0;
  memcpy(p, q, 3);
  return *p;
}

  clang3.9 -O3
    mov     dword ptr [rdi], 0       # *p = 0
    mov     al, byte ptr [rsi + 2]
    mov     byte ptr [rdi + 2], al   # byte copy
    movzx   eax, word ptr [rsi]
    mov     word ptr [rdi], ax       # word copy
    mov     eax, dword ptr [rdi]     # read the whole thing, causing a store-forwarding stall
    ret

gcc doesn't do any better (except on CPUs that don't rename partial regs, since it avoids a false dependency on the old value of EAX by using movzx for the byte copy as well).

Ok, that helps. Yes, I got the register sizes mixed up. Thanks for that and for the simplified trigger. — einpoklum, Dec 31 '16 at 12:42

Margaret Bloom · Accepted Answer · 2016-12-31T11:10:20.957

4

The size three is an ugly size and compilers are not perfect.

The compiler cannot generate an access to a memory location you haven't requested, so it needs two moves.

While it seems trivial for you, remember that you asked for memcpy(&x, p, 4); which is a copy from memory to memory.
Evidently GCC and the older versions of Clang are not smart enough to figure it out there is no reason to pass for a temporary in memory.

What GCC is doing with the first six instructions is basically constructing a DWORD at [rsp-4] with the three bytes, as you requested

mov     DWORD PTR [rsp-4], 0              ;DWORD is 0

movzx   eax, WORD PTR ds:16               ;EAX = byte 0 and byte 1
mov     WORD PTR [rsp-4], ax              ;DWORD has byte 0 and byte 1

movzx   eax, BYTE PTR ds:18               ;EAX = byte 2
mov     BYTE PTR [rsp-2], al              ;DWORD has byte 0, byte 1 and byte 2

mov     edx, DWORD PTR [rsp-4]            ;As previous from henceon

It is using movzx eax, ... to prevent a partial register stall.

The compilers did a great job already by eliding the call to memcpy and as you said the example is "a bit contrived" to follow, even for a human. The memcpy optimisations must work for any size, including those that cannot fit a register. It's not easy to get it right every time.

Considering that L1 access latencies have been lowered considerably in the recent architectures and that [rsp-4] is very likely to be in the cache, I'm not sure it's worth messing with the optimisation code in the GCC source.
It is surely worth filing a bug for a missed optimisation and see what the developers has to say.

edited Dec 31 '16 at 11:10

answered Dec 31 '16 at 11:04

Margaret Bloom

41,768
5
78
124

+1 for making me notice x is being constructed on the stack and for the bug filing link. However, 3 is not that ugly a size. I guess 1024 + 3 is, or 1000 + 3. Lots of things have size 3. An RGB color value has size 3. You might very well need to copy (not necessarily contiguous) 3-byte sequences a large number of times. And the thing is, you can't do a simple assignment from a 3-byte integer into a 4-byte one or vice-versa. – einpoklum Dec 31 '16 at 11:36
2

@einpoklum: it's faster to read 4 bytes and AND with `0x00FFFFFF` if you know it's safe to read that 4th byte. Being common doesn't make it less ugly. Welcome to assembly language. – Peter Cordes Dec 31 '16 at 12:40
1

@einpoklum size 3 is ugly because it is always *misaligned*, that is it doesn't align along boundaries of the native CPU word size (`int` as in C, not to be confused with types like `WORD` and `DWORD` which stem from 16bit architecture legacy). Not all CPU architectures can do misaligned reads or writes (even though x86 can). – user268396 Dec 31 '16 at 12:45
@PeterCordes: This may not be faster if the address of the 3-byte sequence is something like `0x1FFFFFD` and crosses cache lines and memory page boundaries... but if you know that you're safe to read the "convenient" candidate for a 4th byte then yeah. – einpoklum Dec 31 '16 at 12:48
@user268396: I think you could have guessed from my question that I realize that already :-( – einpoklum Dec 31 '16 at 12:49
@MargaretBloom: The GCC bugzilla says "User account creation filtered due to spam". – einpoklum Dec 31 '16 at 12:51
1

@user268396: It's not quite true that it's *always* misaligned. A 3B object can be aligned to whatever boundary you want. This lets you load it as part of a 4B load without any risk of reading into an unmapped page or crossing a cache-line boundary. A packed array of 3B elements only has every 4th element aligned on a 4B-boundary, though, which is what I think you were getting at. It's certainly inconvenient to deal with, especially with SIMD. – Peter Cordes Dec 31 '16 at 12:51
1

@einpoklum: That's true. On some microarchitectures, it might be worth branching to choose between a read that gets the 3B you want as the last 3 of a 4B load or as the first 3. (Then shift or AND to isolate the 3B you want). You can always avoid crossing a page boundary unless the object itself does (and might need to do so for correctness). See also http://stackoverflow.com/questions/37800739/is-it-safe-to-read-past-the-end-of-a-buffer-within-the-same-page-on-x86-and-x64 – Peter Cordes Dec 31 '16 at 12:58
1

@einpoklum - send a short email to the gcc overseers at overseers@gcc.gnu.org asking for an account to be created and they'll set it up usually within a day. – BeeOnRope Dec 31 '16 at 18:20
4

As per @MargaretBloom's suggestion, I have filed [GCC Bug 78963: Missed optimization opportunity in copying small, non-aligned-size data](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78963). Thank you for the helpful answers and suggestions. – einpoklum Jan 01 '17 at 10:30

score 0 · Answer 3 · answered Jan 01 '17 at 06:41

(not a real answer, as I can't add anything to what others already answered, so just example of how I would do such code by hand ... probably mostly for my own curiosity)

If the functions is:

f(24b unsigned n):

f(0) → 0
f(1) → 3
f(n) → n*2, n > 1

(looks to me to be this from your question).

Then I would in hand written assembly (nasm syntax) do this:

    mov     eax,[16]    ; reads 4 bytes from address 16

    ; f(n) starts here, n = low 24b of eax, modifies edx
    xor     edx,edx
    and     eax,0x00FFFFFF
    dec     eax
    setz    dl
    lea     eax,[edx+2*eax+2]
    ; output = low 24b of eax, b24..b31 undefined

    ; writes 3 bytes back to address 16
    mov     [16],ax
    shr     eax,16
    mov     [18],al

clang vs gcc for copying 3 bytes on x86_64 - number of mov's

3 Answers3

Instead of writing such a contrived function: