6

I'm using gcc 12.2 on linux. I use -nostdlib and the compiler complained about lack of memcpy and memmove. So I implemented a bad memcpy in assembly and I had memmove call abort since I always want to use memcpy.

I was wondering if I could avoid the compiler asking for memcpy (and memmove) if I implemented my own in C. The optimizer seems to notice what it really is and called the C function anyway. However since it was implemented (with me using #define memcpy mymemcpy) and since I ran it, I saw my app abort. It called my memmove implementation instead of assembly memcpy. Why is gcc calling move instead of copy?

clang calls memcpy but gcc optimizes my code better so I use it for optimized builds

__attribute__ ((access(write_only, 1))) __attribute__((nonnull(1, 2)))
inline void mymemcpy(void *__restrict__ dest, const void *__restrict__ src, int size)
{
    const unsigned char *s = (const unsigned char*)src;
    unsigned char *d = (unsigned char*)dest;
    while(size--) *d++ = *s++;
}

Reproducible

//dummy.cpp

extern "C" {
void*malloc() { return 0; }
int read() { return 0; }
int write() { return 0; }
int memcpy() { return 0; }
int memmove() { return 0; }
}

//main.cpp
#include <unistd.h>
#include <cstdlib>
struct MyVector {
    void*p;
    long long position, length;
};

__attribute__ ((access(write_only, 1))) __attribute__((nonnull(1, 2)))
void mymemcpy(void *__restrict__ dest, const void *__restrict__ src, int size)
{
    const unsigned char *s = (const unsigned char*)src;
    unsigned char *d = (unsigned char*)dest;
    while(size--) *d++ = *s++;
}

//__attribute__ ((noinline))
int func(const char*file_from_disk, MyVector*v)
{
    if (v->position + 5 <= v->length ) {
        mymemcpy(v->p, file_from_disk, 5);
    }
    return 0;
}

char buf[4096];
extern "C"
int _start() {
    MyVector v{malloc(1024),0,1024};
    v.position += read(0, v.p, 1024-5);
    int len = read(0, buf, 4096);
    func(buf, &v);
    write(1, v.p, v.position);
}

g++ -march=native -nostdlib -static -fno-exceptions -fno-rtti -O2 main.cpp dummy.cpp

Check using objdump -D a.out | grep call

401040: e8 db 00 00 00          call   401120 <memmove>
40108d: e8 4e 00 00 00          call   4010e0 <malloc>
4010a3: e8 48 00 00 00          call   4010f0 <read>
4010ba: e8 31 00 00 00          call   4010f0 <read>
4010c5: e8 56 ff ff ff          call   401020 <_Z4funcPKcP8MyVector>
4010d5: e8 26 00 00 00          call   401100 <write>
402023: ff 11                   call   *(%rcx)
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Stan
  • 161
  • 8
  • If it wants to call `memmove` for any reason, and you provide an implementation of it that aborts, why are you surprised it aborted?? – ShadowRanger Dec 13 '22 at 23:48
  • 1
    You missed a step. I'm surprised it called memmove, unlike clang which calls memcpy – Stan Dec 13 '22 at 23:49
  • 1
    At least historically, [some platforms implemented `memcpy` using the `memmove` implementation (because too many programs used the wrong one)](https://stackoverflow.com/a/1961237/364696). Beyond that, we have no idea what code is using one or the other; it's wholly possible one optimizer could verify `memcpy` was safe in a place that used `memmove` and used `memcpy`. We'd need a [MCVE] to have the slightest basis to base an answer on. You say you're surprised it called one or the other, but we have *no* idea where anything is calling it, period. Could be the C runtime initialization... – ShadowRanger Dec 13 '22 at 23:55
  • As far as I remember from reading the spec, memmove is the only function that has defined behaviour for arguments that are overlapping in memory (e.g. `memmove(buf, buf+3, 10)`. For all the other functions, including memcpy, such an operation would be undefined behaviour. So maybe there is something that triggers gcc to use memmove over memcpy for that reason? – hlovdal Dec 13 '22 at 23:58
  • You may wanna try compiling with `-ffreestanding` too. But also, please show the code in a [mre], and also see if you can reproduce it on godbolt.org for example. – Marco Bonelli Dec 14 '22 at 00:27
  • Are you saying that GCC and Clang generate `memmove` or `memcpy` calls for the code shown? You should state that explicitly in the question. `-nostdlib` does not turn that off. You should use `-ffreestanding` or `-fno-builtin`. (I would think `-fno-builtin-memcpy -fno-builtin-memmove` would work too, but [testing shows otherwise](https://godbolt.org/z/n9afrhzWa) for GCC. Clang seems to obey it. With `-O3`, Clang generates a ton of code; it must be doing some performance optimization. `-Os` gives smaller code.) – Eric Postpischil Dec 14 '22 at 00:32
  • @EricPostpischil: The correct flag is `-fno-tree-loop-distribute-patterns`. – Dietrich Epp Dec 14 '22 at 01:02
  • @hlovdal I hope I can override that. I edited in a reproducible – Stan Dec 14 '22 at 01:15
  • 1
    @MarcoBonelli Will freestanding disable certain optimizations? Looking at mymemcpy it no longer calls memmove but it's the literal implementation copying one byte at a time. It doesnt seemed optimized. I guess I could write a good implementation but I was hoping gcc would write in good implementations. Using objdump after freestanding shows me it was inlined with the terrible implementation (copyed 5 bytes one at a time) – Stan Dec 14 '22 at 01:23
  • The `_start` function should not return. In some sense, it is not a function at all, but just the entry point for your program. Normally you would put an `exit()` at the end so it doesn't return. – Dietrich Epp Dec 14 '22 at 01:23
  • @EricPostpischil repo is up. Dietrich yes I know. Don't look at dummy.cpp, you'll be horrified – Stan Dec 14 '22 at 01:25
  • @ShadowRanger repo is up – Stan Dec 14 '22 at 01:26
  • From memory, `#define`ing names from the standard library - such as in `#define memcpy mymemcpy` - gives undefined behaviour. One take on that is that the compiler is free to interpret a usage of `memcpy(dest,src, size)` as an actual usage of the standard `memcpy()` (so can inline it, just as it might with an intentional call of `memcpy()`). – Peter Dec 14 '22 at 02:28
  • @hlovdal: `memmove` works *as if* it copied to a temp buffer and back, so an overlapping source is fully read. (In practice this is handled by looping backwards on overlap, not actually a tmp buffer). Without `__restrict`, the OP's function would have well-defined behaviour for overlap that differed from `memmove`, since it copies a byte at a time. e.g. `my_memcpy(buf+1, buf, 10)` would repeat the first byte 10 times. So maybe it has something to do with UB-on-overlap, but not in any obvious way! – Peter Cordes Dec 15 '22 at 02:22
  • Indeed, you really want to avoid `-ffreestanding`, unless you use `-fbuiltin-memcpy` or similar to re-enable specific functions, especially if you ever write code that relies on small fixed-size `memcpy` being inlined (like for type punning). That might be one reason to use `-fno-strict-aliasing` if you have to use `-ffreestanding`, so you can type-pun with pointer-casting. Or if you're careful with `typedef uint64_t aliasing_u64 __attribute__((aligned(1),may_alias))`, you can use pointers of that type. – Peter Cordes Dec 15 '22 at 02:25
  • @PeterCordes is there a way I can find out if I'm breaking aliasing rules? I know I had in one place until I fixed it recently but I'm not sure if that was the only spot. I could try clang warn everything and hope I see something but I'm not sure if that'll warn me of all cases – Stan Dec 15 '22 at 02:32
  • I'm not sure; most violations are obvious when you're pointer-casting to anything except a compatible type, or `char` or a may_alias type like `__m128i`. GCC/clang `-fsanitize=undefined` don't detect it at runtime for an unsafe type-pun, but GCC warns at compile time: https://gcc.godbolt.org/z/16MrWWq9a . – Peter Cordes Dec 15 '22 at 02:38

2 Answers2

5

An exact answer requires diving into the code transformations that GCC performs and looking at how your code is transformed by GCC. That's beyond what I can do in a reasonable amount of time, but I can show you what's going on in more general terms, without diving into GCC internals.

Here's the crazy part: If you remove inline, you will get memcpy. With inline, you get memmove. I'll show the results on Godbolt and then talk about how compilers work to explain it.

The Code

Here's some test code I put on Godbolt.

__attribute__ ((access(write_only, 1))) __attribute__((nonnull(1, 2)))
extern inline void mymemcpy(void *__restrict__ dest, const void *__restrict__ src, int size)
{
    const unsigned char *s = (const unsigned char*)src;
    unsigned char *d = (unsigned char*)dest;
    while(size--) *d++ = *s++;
}

void test(void *dest, const void *src, int size)
{
    mymemcpy(dest, src, size);
}

Here's the resulting assembly

mymemcpy:
        test    edx, edx
        je      .L1
        mov     edx, edx
        jmp     memcpy
.L1:
        ret
test:
        test    edx, edx
        je      .L4
        mov     edx, edx
        jmp     memmove
.L4:
        ret

Yes, you can see that one function is getting converted to memcpy or memmove. It's not just the same code, it's just one function, which is getting transformed differently depending on whether or not it is inlined. Why?

How Optimization Passes Work

You might think of a C compiler as doing something like this:

  1. Preprocess + tokenize source files,

  2. Parse to create AST,

  3. Type check,

  4. Optimize,

  5. Emit code.

In reality, that "optimization" item is many different passes through the code, and each of those passes modify the code in different ways. These passes happen at different times during compilation, and some optimization passes may happen multiple times.

The order in which specific optimization passes occur affects the results. If you perform optimization X and then optimization Y, you get a different result from doing Y and then X. Maybe one transformation propagates information from one part of the program to another, and then a different transformation acts on that information.

Why is this relevant here?

You can see here that there's a restrict pointer src and dest. Since these pointers are restrict, GCC "should" be able to know that memcpy is acceptable, and memmove is not necessary.

However, that means that the information that src and dest are restrict pointers must be propagated to the loop which is ultimately transformed into memmove or memcpy, and that information must be propagated before the transformation takes place. You could easily first transform the loop into memmove and then, later, figure out that the arguments are restrict, but it's too late!

It looks like, somehow, the information that src and dest are restrict is getting lost when the function is inlined. This gives us a couple different theories for why this might happen:

  • Maybe the propagation of restrict is somehow broken after inlining, due to a bug.

  • Maybe GCC infers restrict from the calling function after inlining, under the assumption that the calling function has more context than the function being inlined.

  • Maybe the optimization passes don't happen in the right order here for the restrict to propagate to the loop. Maybe that information propagates, and then inlining is performed afterwards, and then the loop optimization happens after that.

Optimization passes (code transformation passes) are sensitive to reordering, after all. This is an extremely complicated area of compiler design.

Disabling The Optimization

Use -fno-tree-loop-distribute-patterns, or use a pragma:

#pragma GCC optimize ("no-tree-loop-distribute-patterns")
Dietrich Epp
  • 205,541
  • 37
  • 345
  • 415
  • Nice find. Maybe I'll handwrite a good implementation and try my luck – Stan Dec 14 '22 at 01:37
  • That's what I decided to do--implement my own `memcpy` variant in assembly, rather than fight the compiler. I'm doing this because I'm making a Game Boy Advance game. I'm using `-nodefaultlibs`, and I just made the core `memcmp`, `memset`, `memcpy`, and `memmove` functions in assembler. – Dietrich Epp Dec 14 '22 at 01:42
  • `memmove` works *as if* it copied to a temp buffer and back, so an overlapping source isn't overwritten early. (In practice this is handled by looping backwards on overlap, not actually a tmp buffer). Without `__restrict`, the OP's function would have well-defined behaviour for overlap that differed from `memmove`, since it copies a byte at a time. e.g. `my_memcpy(buf+1, buf, 10)` would repeat the first byte 10 times. Removing `__restrict` makes GCC copy a byte at a time, correctly avoiding recognizing it as a memcpy or memmove (https://gcc.godbolt.org/z/M8jMWcPxs) – Peter Cordes Dec 15 '22 at 02:28
  • 1
    @PeterCordes: Exactly right! `memmove` would not be a valid optimization without restrict. I wonder what's going on here. – Dietrich Epp Dec 15 '22 at 07:10
0

simple use -fno-builtin command line option.

https://godbolt.org/z/3Ys1s9jPr

0___________
  • 60,014
  • 4
  • 34
  • 74
  • 2
    Note that you may prefer `-fno-tree-loop-distribute-patterns`, which still allows GCC to recognize the use of built-in functions. – Dietrich Epp Dec 14 '22 at 01:24