gcc's __builtin_memcpy performance with certain number of bytes is terrible compared to clang's

Question

I thought I`d first share this here to have your opinions before doing anything else. I found out while designing an algorithm that the gcc compiled code performance for some simple code was catastrophic compared to clang's.

How to reproduce

Create a test.c file containing this code :

#include <sys/stat.h>
#include <sys/types.h>
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <stdbool.h>
#include <string.h>

int main(int argc, char *argv[]) {
    const uint64_t size = 1000000000;
    const size_t alloc_mem = size * sizeof(uint8_t);
    uint8_t *mem = (uint8_t*)malloc(alloc_mem);
    for (uint_fast64_t i = 0; i < size; i++)
        mem[i] = (uint8_t) (i >> 7);

    uint8_t block = 0;
    uint_fast64_t counter = 0;
    uint64_t total = 0x123456789abcdefllu;
    uint64_t receiver = 0;

    for(block = 1; block <= 8; block ++) {
        printf("%u ...\n", block);
        counter = 0;
        while (counter < size - 8) {
            __builtin_memcpy(&receiver, &mem[counter], block);
            receiver &= (0xffffffffffffffffllu >> (64 - ((block) << 3)));
            total += ((receiver * 0x321654987cbafedllu) >> 48);
            counter += block;
        }
    }

    printf("=> %llu\n", total);
    return EXIT_SUCCESS;
}

gcc

Compile and run :

gcc-7 -O3 test.c
time ./a.out
1 ...
2 ...
3 ...
4 ...
5 ...
6 ...
7 ...
8 ...
=> 82075168519762377

real    0m23.367s
user    0m22.634s
sys 0m0.495s

info :

gcc-7 -v
Using built-in specs.
COLLECT_GCC=gcc-7
COLLECT_LTO_WRAPPER=/usr/local/Cellar/gcc/7.3.0/libexec/gcc/x86_64-apple-darwin17.4.0/7.3.0/lto-wrapper
Target: x86_64-apple-darwin17.4.0
Configured with: ../configure --build=x86_64-apple-darwin17.4.0 --prefix=/usr/local/Cellar/gcc/7.3.0 --libdir=/usr/local/Cellar/gcc/7.3.0/lib/gcc/7 --enable-languages=c,c++,objc,obj-c++,fortran --program-suffix=-7 --with-gmp=/usr/local/opt/gmp --with-mpfr=/usr/local/opt/mpfr --with-mpc=/usr/local/opt/libmpc --with-isl=/usr/local/opt/isl --with-system-zlib --enable-checking=release --with-pkgversion='Homebrew GCC 7.3.0' --with-bugurl=https://github.com/Homebrew/homebrew-core/issues --disable-nls
Thread model: posix
gcc version 7.3.0 (Homebrew GCC 7.3.0)

So we get about 23s of user time. Now let's do the same with cc (clang on macOS) :

clang

cc -O3 test.c
time ./a.out
1 ...
2 ...
3 ...
4 ...
5 ...
6 ...
7 ...
8 ...
=> 82075168519762377

real    0m9.832s
user    0m9.310s
sys 0m0.442s

info :

Apple LLVM version 9.0.0 (clang-900.0.39.2)
Target: x86_64-apple-darwin17.4.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

That's more than 2.5x faster !! Any thoughts ?

I replaced the __builtin_memcpy function by memcpy to test things out and this time the compiled code runs in about 34s on both sides - consistent and slower as expected.

It would appear that the combination of __builtin_memcpy and bitmasking is interpreted very differently by both compilers. I had a look at the assembly code, but couldn't see anything standing out that would explain such a drop in performance as I'm not an asm expert.

Edit 03-05-2018 : Posted this bug : https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84719.

maybe `__builtin_memcpy` isn't the same as `memcpy` and clang's is faster ? what about a simple `memcpy` bench (without the masks) ? — Jean-François Fabre, Mar 04 '18 at 17:57
Yes that's definitely the case and it is expected, but the described performance drop isn't ! — gpnuma, Mar 04 '18 at 17:58
related: https://stackoverflow.com/questions/11747891/when-builtin-memcpy-is-replaced-with-libcs-memcpy — Jean-François Fabre, Mar 04 '18 at 17:59
very good first question BTW. it's rare. And _stellar_ first sunday evening question. — Jean-François Fabre, Mar 04 '18 at 17:59
Sorry to but some salt on it: It is indeed an interesting question. Nevertheless, I think it is too broad and information is missing, e.g. the machine-code (and if it was given, it would be still too broad). It is no news different compilers generate different code, sometimes by some magnitudes different in size and run-time. Things become worse if libraries are involved. As general advice: check the machine code and the implementation/machine-code of `__builtin_memcpy` for both compilers' library. — too honest for this site, Mar 04 '18 at 18:23
Maybe related to loop expansion. clang's loop expansion is significantly deeper than gcc. — llllllllll, Mar 04 '18 at 18:27
Sidenote: reducing the types used and using the correct types (e.g. `malloc` takes a `size_t`, not an `uint64_t` or larger) would make the code easier to comprehend. I'm not sure there is no UB from the shifts which would make any comparison moot. — too honest for this site, Mar 04 '18 at 18:50
Olaf, thanks for your answer ! Yes the code is not ideal but it is a testing snippet so please don't mind. Do you want the asm code or the actual machine code generated ? If it's the asm I created a gist here : https://gist.github.com/gpnuma/0d67d1a19bbd337ac1fe1945985cb264. — gpnuma, Mar 04 '18 at 21:41
Are they both 64-bit binaries? What's the output from `file` on both executables? — Andrew Henle, Mar 04 '18 at 21:58
Stephen, yes I just edited the code slightly so it would compile without any errors by default in compiler explorer, it's easier to track the generated asm. Andrew, yes they are both 64bit binaries, the output of `file` is `a.out: Mach-O 64-bit executable x86_64` for both binaries. — gpnuma, Mar 04 '18 at 22:27
try to replace 'while' with 'for'. I've seen situations where C++ compilers were able to optimize 'for' loops better than 'while' equivalent. — Bulat, Mar 04 '18 at 23:34

Marc Glisse · Accepted Answer · 2018-03-04T23:19:24.200

4

I find it suspicious that you get different code for memcpy vs __builtin_memcpy. I don't think that's supposed to happen, and indeed I cannot reproduce it on my (linux) system.

If you add #pragma GCC unroll 16 (implemented in gcc-8+) before the for loop, gcc gets the same perf as clang (making block a constant is essential to optimize the code), so essentially llvm's unrolling is more aggressive than gcc's, which can be good or bad depending on cases. Still, feel free to report it to gcc, maybe they'll tweak the unrolling heuristics some day and an extra testcase could help.

Once unrolling is taken care of, gcc does ok for some values (block equals 4 or 8 in particular), but much worse for some others, in particular 3. But that's better analyzed with a smaller testcase without the loop on block. Gcc seems to have trouble with memcpy(,,3), it works much better if you always read 8 bytes (the next line already takes care of the extra bytes IIUC). Another thing that could be reported to gcc.

edited Mar 04 '18 at 23:19

answered Mar 04 '18 at 22:18

Marc Glisse

7,550
2
30
53

Hey Marc thanks, I tried your solution and it doesn't change anything here if I add `#pragma GCC unroll 16` before the for loop. Could you please confirm that no other flag is required ? – gpnuma Mar 04 '18 at 22:24
You're very right actually when you say that making block a constant should improve optimization, I had thought about that and created a set of macro preprocessor generators to make block constant (created 8 different loops generated by a preprocessor macro), but even then I still get a big performance issue with gcc 7.3 compared to clang. So I'm not completely convinced that loop unrolling is the clear problem here. – gpnuma Mar 04 '18 at 22:45
I actually just retested it. If I do this : https://gist.github.com/gpnuma/879dc522675a75dafb69c0b117742f3f which is horrible coding anyways, I get the following result : gcc 1 ... 2 ... 3 ... 4 ... 5 ... 6 ... 7 ... 8 ... => 82075168519762377 real 0m9.413s user 0m8.931s sys 0m0.382s and clang 1 ... 2 ... 3 ... 4 ... 5 ... 6 ... 7 ... 8 ... => 82075168519762377 real 0m3.096s user 0m2.599s sys 0m0.395s so more than **3x faster** now. – gpnuma Mar 04 '18 at 22:50
And just to make sure `block` is interpreted as a constant by the compiler, I also tested this https://gist.github.com/gpnuma/f8b175ecee57f1cfb8af90f4618bd2eb and still get the same result of clang code being 3x faster... – gpnuma Mar 04 '18 at 22:55
Marc, you're absolutely right, the workaround I found for this was to actually read 8 bytes **all the time**, and that made the compiled code from gcc on par with clangs', but it still doesn't "explain" why this is failing in the first place. – gpnuma Mar 04 '18 at 23:21
Because gcc doesn't cleverly handle memcpy when the size doesn't match an actual type? At some point the answer can't really be more precise than "because noone implemented this in gcc"... – Marc Glisse Mar 04 '18 at 23:28

gcc's __builtin_memcpy performance with certain number of bytes is terrible compared to clang's

1 Answers1