Why is memcpy slow in 32-bit mode with gcc -march=native on Ryzen, for large buffers?

Question

I wrote a simple test (code down the bottom) to benchmark the performance of memcpy on my 64-bit Debian system. On my system when compiled as a 64-bit binary this gives a consistent 38-40GB/s across all block sizes. However when built as a 32-bit binary on the same system the copy performance is abysmal.

I wrote my own memcpy implementation in assembler that leverages SIMD which is able to match the 64-bit performance. I am honestly shocked that my own memcpy is so much faster then the native, surely something must be wrong with the 32-bit libc build.

32 bit memcpy test results

0x00100000 B, 0.034215 ms, 29227.06 MB/s (16384 iterations)
0x00200000 B, 0.033453 ms, 29892.56 MB/s ( 8192 iterations)
0x00300000 B, 0.048710 ms, 20529.48 MB/s ( 5461 iterations)
0x00400000 B, 0.049187 ms, 20330.54 MB/s ( 4096 iterations)
0x00500000 B, 0.058945 ms, 16965.01 MB/s ( 3276 iterations)
0x00600000 B, 0.060735 ms, 16465.01 MB/s ( 2730 iterations)
0x00700000 B, 0.068973 ms, 14498.34 MB/s ( 2340 iterations)
0x00800000 B, 0.078325 ms, 12767.34 MB/s ( 2048 iterations)
0x00900000 B, 0.099801 ms, 10019.92 MB/s ( 1820 iterations)
0x00a00000 B, 0.111160 ms,  8996.04 MB/s ( 1638 iterations)
0x00b00000 B, 0.120044 ms,  8330.31 MB/s ( 1489 iterations)
0x00c00000 B, 0.116506 ms,  8583.26 MB/s ( 1365 iterations)
0x00d00000 B, 0.120322 ms,  8311.06 MB/s ( 1260 iterations)
0x00e00000 B, 0.114424 ms,  8739.40 MB/s ( 1170 iterations)
0x00f00000 B, 0.128843 ms,  7761.37 MB/s ( 1092 iterations)
0x01000000 B, 0.118122 ms,  8465.85 MB/s ( 1024 iterations)
0x08000000 B, 0.140218 ms,  7131.76 MB/s (  128 iterations)
0x10000000 B, 0.115596 ms,  8650.85 MB/s (   64 iterations)
0x20000000 B, 0.115325 ms,  8671.16 MB/s (   32 iterations)

64 bit memcpy test results

0x00100000 B, 0.022237 ms, 44970.48 MB/s (16384 iterations)
0x00200000 B, 0.022293 ms, 44856.77 MB/s ( 8192 iterations)
0x00300000 B, 0.021729 ms, 46022.49 MB/s ( 5461 iterations)
0x00400000 B, 0.028348 ms, 35275.28 MB/s ( 4096 iterations)
0x00500000 B, 0.026118 ms, 38288.08 MB/s ( 3276 iterations)
0x00600000 B, 0.026161 ms, 38225.15 MB/s ( 2730 iterations)
0x00700000 B, 0.026199 ms, 38169.68 MB/s ( 2340 iterations)
0x00800000 B, 0.026236 ms, 38116.22 MB/s ( 2048 iterations)
0x00900000 B, 0.026090 ms, 38329.50 MB/s ( 1820 iterations)
0x00a00000 B, 0.026085 ms, 38336.39 MB/s ( 1638 iterations)
0x00b00000 B, 0.026079 ms, 38345.59 MB/s ( 1489 iterations)
0x00c00000 B, 0.026147 ms, 38245.75 MB/s ( 1365 iterations)
0x00d00000 B, 0.026033 ms, 38412.69 MB/s ( 1260 iterations)
0x00e00000 B, 0.026037 ms, 38407.40 MB/s ( 1170 iterations)
0x00f00000 B, 0.026019 ms, 38433.80 MB/s ( 1092 iterations)
0x01000000 B, 0.026041 ms, 38401.61 MB/s ( 1024 iterations)
0x08000000 B, 0.026123 ms, 38280.89 MB/s (  128 iterations)
0x10000000 B, 0.026083 ms, 38338.70 MB/s (   64 iterations)
0x20000000 B, 0.026116 ms, 38290.93 MB/s (   32 iterations)

custom 32 bit memcpy

0x00100000 B, 0.026807 ms, 37303.21 MB/s (16384 iterations)
0x00200000 B, 0.026500 ms, 37735.59 MB/s ( 8192 iterations)
0x00300000 B, 0.026810 ms, 37300.04 MB/s ( 5461 iterations)
0x00400000 B, 0.026214 ms, 38148.05 MB/s ( 4096 iterations)
0x00500000 B, 0.026738 ms, 37399.74 MB/s ( 3276 iterations)
0x00600000 B, 0.026035 ms, 38409.15 MB/s ( 2730 iterations)
0x00700000 B, 0.026262 ms, 38077.29 MB/s ( 2340 iterations)
0x00800000 B, 0.026190 ms, 38183.00 MB/s ( 2048 iterations)
0x00900000 B, 0.026287 ms, 38042.18 MB/s ( 1820 iterations)
0x00a00000 B, 0.026263 ms, 38076.66 MB/s ( 1638 iterations)
0x00b00000 B, 0.026162 ms, 38223.48 MB/s ( 1489 iterations)
0x00c00000 B, 0.026189 ms, 38183.45 MB/s ( 1365 iterations)
0x00d00000 B, 0.026012 ms, 38444.52 MB/s ( 1260 iterations)
0x00e00000 B, 0.026089 ms, 38330.05 MB/s ( 1170 iterations)
0x00f00000 B, 0.026373 ms, 37917.10 MB/s ( 1092 iterations)
0x01000000 B, 0.026304 ms, 38016.85 MB/s ( 1024 iterations)
0x08000000 B, 0.025958 ms, 38523.59 MB/s (  128 iterations)
0x10000000 B, 0.025992 ms, 38473.84 MB/s (   64 iterations)
0x20000000 B, 0.026020 ms, 38431.96 MB/s (   32 iterations)

Test Program

(compile with: gcc -m32 -march=native -O3)

#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <stdint.h>
#include <malloc.h>

static inline uint64_t nanotime()
{
  struct timespec time;
  clock_gettime(CLOCK_MONOTONIC_RAW, &time);
  return ((uint64_t)time.tv_sec * 1e9) + time.tv_nsec;
}

void test(const int size)
{
  char * buffer1 = memalign(128, size);
  char * buffer2 = memalign(128, size);

  for(int i = 0; i < size; ++i)
    buffer2[i] = i;

  uint64_t t           = nanotime();
  const uint64_t loops = (16384LL * 1048576LL) / size;
  for(uint64_t i = 0; i < loops; ++i)
    memcpy(buffer1, buffer2, size);
  double ms = (((float)(nanotime() - t) / loops) / 1000000.0f) / (size / 1024 / 1024);
  printf("0x%08x B, %8.6f ms, %8.2f MB/s (%5llu iterations)\n", size, ms, 1000.0 / ms, loops);

  // prevent the compiler from trying to optimize out the copy
  if (buffer1[0] == 0x0)
    return;

  free(buffer1);
  free(buffer2);
}

int main(int argc, char * argv[])
{
  for(int i = 0; i < 16; ++i)
    test((i+1) * 1024 * 1024);

  test(128 * 1024 * 1024);
  test(256 * 1024 * 1024);
  test(512 * 1024 * 1024);
  return 0;
}

Edit

Tested on a Ryzen 7 and a ThreadRipper 1950x
glibc: 2.27
gcc: 7.3.0

perf results:

  99.68%  x32.n.bin  x32.n.bin          [.] test
   0.28%  x32.n.bin  [kernel.kallsyms]  [k] clear_page_rep
   0.01%  x32.n.bin  [kernel.kallsyms]  [k] get_page_from_freelist
   0.01%  x32.n.bin  [kernel.kallsyms]  [k] __mod_node_page_state
   0.01%  x32.n.bin  [kernel.kallsyms]  [k] page_fault
   0.00%  x32.n.bin  [kernel.kallsyms]  [k] default_send_IPI_single
   0.00%  perf_4.17  [kernel.kallsyms]  [k] __x86_indirect_thunk_r14

custom SSE implementation

inline static void memcpySSE(void *dst, const void * src, size_t length)
{
#if (defined(__x86_64__) || defined(__i386__))
  if (length == 0 || dst == src)
    return;

#ifdef __x86_64__
  const void * end = dst + (length & ~0xFF);
  size_t off = (15 - ((length & 0xFF) >> 4));
  off = (off < 8) ? off * 16 : 7 * 16 + (off - 7) * 10;
#else
  const void * end = dst + (length & ~0x7F);
  const size_t off = (7 - ((length & 0x7F) >> 4)) * 10;
#endif

#ifdef __x86_64__
  #define REG "rax"
#else
  #define REG "eax"
#endif

  __asm__ __volatile__ (
   "cmp         %[dst],%[end] \n\t"
   "je          Remain_%= \n\t"

   // perform SIMD block copy
   "loop_%=: \n\t"
   "vmovaps     0x00(%[src]),%%xmm0  \n\t"
   "vmovaps     0x10(%[src]),%%xmm1  \n\t"
   "vmovaps     0x20(%[src]),%%xmm2  \n\t"
   "vmovaps     0x30(%[src]),%%xmm3  \n\t"
   "vmovaps     0x40(%[src]),%%xmm4  \n\t"
   "vmovaps     0x50(%[src]),%%xmm5  \n\t"
   "vmovaps     0x60(%[src]),%%xmm6  \n\t"
   "vmovaps     0x70(%[src]),%%xmm7  \n\t"
#ifdef __x86_64__
   "vmovaps     0x80(%[src]),%%xmm8  \n\t"
   "vmovaps     0x90(%[src]),%%xmm9  \n\t"
   "vmovaps     0xA0(%[src]),%%xmm10 \n\t"
   "vmovaps     0xB0(%[src]),%%xmm11 \n\t"
   "vmovaps     0xC0(%[src]),%%xmm12 \n\t"
   "vmovaps     0xD0(%[src]),%%xmm13 \n\t"
   "vmovaps     0xE0(%[src]),%%xmm14 \n\t"
   "vmovaps     0xF0(%[src]),%%xmm15 \n\t"
#endif
   "vmovntdq    %%xmm0 ,0x00(%[dst]) \n\t"
   "vmovntdq    %%xmm1 ,0x10(%[dst]) \n\t"
   "vmovntdq    %%xmm2 ,0x20(%[dst]) \n\t"
   "vmovntdq    %%xmm3 ,0x30(%[dst]) \n\t"
   "vmovntdq    %%xmm4 ,0x40(%[dst]) \n\t"
   "vmovntdq    %%xmm5 ,0x50(%[dst]) \n\t"
   "vmovntdq    %%xmm6 ,0x60(%[dst]) \n\t"
   "vmovntdq    %%xmm7 ,0x70(%[dst]) \n\t"
#ifdef __x86_64__
   "vmovntdq    %%xmm8 ,0x80(%[dst]) \n\t"
   "vmovntdq    %%xmm9 ,0x90(%[dst]) \n\t"
   "vmovntdq    %%xmm10,0xA0(%[dst]) \n\t"
   "vmovntdq    %%xmm11,0xB0(%[dst]) \n\t"
   "vmovntdq    %%xmm12,0xC0(%[dst]) \n\t"
   "vmovntdq    %%xmm13,0xD0(%[dst]) \n\t"
   "vmovntdq    %%xmm14,0xE0(%[dst]) \n\t"
   "vmovntdq    %%xmm15,0xF0(%[dst]) \n\t"

   "add         $0x100,%[dst] \n\t"
   "add         $0x100,%[src] \n\t"
#else
   "add         $0x80,%[dst] \n\t"
   "add         $0x80,%[src] \n\t"
#endif
   "cmp         %[dst],%[end] \n\t"
   "jne         loop_%= \n\t"

   "Remain_%=: \n\t"

   // copy any remaining 16 byte blocks
#ifdef __x86_64__
   "leaq        (%%rip), %%rax\n\t"
#else
   "call        GetPC_%=\n\t"
#endif
   "Offset_%=:\n\t"
   "add         $(BlockTable_%= - Offset_%=), %%" REG "\n\t"
   "add         %[off],%%" REG " \n\t"
   "jmp         *%%" REG " \n\t"

#ifdef __i386__
  "GetPC_%=:\n\t"
  "mov (%%esp), %%eax \n\t"
  "ret \n\t"
#endif

   "BlockTable_%=:\n\t"
#ifdef __x86_64__
   "vmovaps     0xE0(%[src]),%%xmm14 \n\t"
   "vmovntdq    %%xmm14,0xE0(%[dst]) \n\t"
   "vmovaps     0xD0(%[src]),%%xmm13 \n\t"
   "vmovntdq    %%xmm13,0xD0(%[dst]) \n\t"
   "vmovaps     0xC0(%[src]),%%xmm12 \n\t"
   "vmovntdq    %%xmm12,0xC0(%[dst]) \n\t"
   "vmovaps     0xB0(%[src]),%%xmm11 \n\t"
   "vmovntdq    %%xmm11,0xB0(%[dst]) \n\t"
   "vmovaps     0xA0(%[src]),%%xmm10 \n\t"
   "vmovntdq    %%xmm10,0xA0(%[dst]) \n\t"
   "vmovaps     0x90(%[src]),%%xmm9  \n\t"
   "vmovntdq    %%xmm9 ,0x90(%[dst]) \n\t"
   "vmovaps     0x80(%[src]),%%xmm8  \n\t"
   "vmovntdq    %%xmm8 ,0x80(%[dst]) \n\t"
   "vmovaps     0x70(%[src]),%%xmm7  \n\t"
   "vmovntdq    %%xmm7 ,0x70(%[dst]) \n\t"
#endif
   "vmovaps     0x60(%[src]),%%xmm6  \n\t"
   "vmovntdq    %%xmm6 ,0x60(%[dst]) \n\t"
   "vmovaps     0x50(%[src]),%%xmm5  \n\t"
   "vmovntdq    %%xmm5 ,0x50(%[dst]) \n\t"
   "vmovaps     0x40(%[src]),%%xmm4  \n\t"
   "vmovntdq    %%xmm4 ,0x40(%[dst]) \n\t"
   "vmovaps     0x30(%[src]),%%xmm3  \n\t"
   "vmovntdq    %%xmm3 ,0x30(%[dst]) \n\t"
   "vmovaps     0x20(%[src]),%%xmm2  \n\t"
   "vmovntdq    %%xmm2 ,0x20(%[dst]) \n\t"
   "vmovaps     0x10(%[src]),%%xmm1  \n\t"
   "vmovntdq    %%xmm1 ,0x10(%[dst]) \n\t"
   "vmovaps     0x00(%[src]),%%xmm0  \n\t"
   "vmovntdq    %%xmm0 ,0x00(%[dst]) \n\t"
   "nop\n\t"
   "nop\n\t"

   : [dst]"+r" (dst),
     [src]"+r" (src)
   : [off]"r"  (off),
     [end]"r"  (end)
   : REG,
     "xmm0",
     "xmm1",
     "xmm2",
     "xmm3",
     "xmm4",
     "xmm5",
     "xmm6",
     "xmm7",
#ifdef __x86_64__
     "xmm8",
     "xmm9",
     "xmm10",
     "xmm11",
     "xmm12",
     "xmm13",
     "xmm14",
     "xmm15",
#endif
     "memory"
  );

#undef REG

  //copy any remaining bytes
  for(size_t i = (length & 0xF); i; --i)
    ((uint8_t *)dst)[length - i] =
      ((uint8_t *)src)[length - i];
#else
  memcpy(dst, src, length);
#endif
}

native memcpy with `-O3 -m32 -march=znver1`

  cmp ebx, 4
  jb .L56
  mov ecx, DWORD PTR [ebp+0]
  lea edi, [eax+4]
  mov esi, ebp
  and edi, -4
  mov DWORD PTR [eax], ecx
  mov ecx, DWORD PTR [ebp-4+ebx]
  mov DWORD PTR [eax-4+ebx], ecx
  mov ecx, eax
  sub ecx, edi
  sub esi, ecx
  add ecx, ebx
  shr ecx, 2
  rep movsd
  jmp .L14

What hardware (Skylake? Ryzen?), and what glibc build? Use `perf record ./bench` and `perf report -Mintel` to find out *which* `memcpy` implementation glibc used on your system (using the dynamic linker to select an SSE2 vs. AVX vs. [ERMSB](https://stackoverflow.com/questions/43343231/enhanced-rep-movsb-for-memcpy) version). Then post the inner loop of it vs. your custom version. It's interesting that the 1MiB size only went slightly faster than the 512MiB version; I guess your single-core memory bandwidth is mostly limited by uncore latency and max concurrency, not DRAM bandwidth! — Peter Cordes, May 19 '18 at 06:58
So `99.68%` of the time was spent in your `test` function itself, none in `__memcpy_sse2_unaligned` or anything like that. So apparently gcc is inlining memcpy. What asm is it using? `rep movsb`? What gcc version, so we can put it on http://gcc.godbolt.org/ and see the asm output with `-O3 -m32 -march=znver1`? — Peter Cordes, May 19 '18 at 07:15
And did all your runs end up with the same use of 2M hugepages, to minimize TLB misses? — Peter Cordes, May 19 '18 at 07:16
gcc 7.3.0. Yes, all runs are with transparent huge pages enabled and all tests are powers of two. It is using SSE2, I will update the question with the objdump output of the loop. — Geoffrey, May 19 '18 at 07:18
But did you check that it actually *did* use hugepages? That doesn't always happen if you don't `madvise(MADV_HUGEPAGE)` or set `/sys/kernel/mm/transparent_hugepage/defrag` to `always`, because there might not be enough contiguous physical pages if it doesn't try to defrag. (Also, `echo always >/sys/kernel/mm/transparent_hugepage/enabled` is needed if it's not set by default. Just noticed that my Arch system had it set to `madvise`; apparently the default changed from always to madvise recently.) https://www.kernel.org/doc/Documentation/vm/transhuge.txt — Peter Cordes, May 19 '18 at 07:23
`$ cat /sys/kernel/mm/transparent_hugepage/enabled [always] madvise never `. Also `/proc/meminfo` shows hugepages are in use everywhere. Setting `defrag` to always doesn't make a difference. — Geoffrey, May 19 '18 at 07:25
Ok. BTW, you can control-z your program while it's running and look at `/proc/PID/smaps` and check the `AnonHugePages:` on a per-mapping basis to see what actually happened. — Peter Cordes, May 19 '18 at 07:33
Using https://gcc.godbolt.org/ (excellent tool, first time I have seen it) it is apparent that the inline memcpy is not using SSE at all. Could it be that the debian libc-i386 is not compiled with SSE support?... Confirmed, objdump shows no SSE used in the memcpy inlined. Please post this as an answer so I can give you the well deserved points for this. — Geoffrey, May 19 '18 at 07:36
@PeterCordes you are dead on with that suggestion, `-fno-builtin-memcpy` fixes the problem. Please post it as your answer! — Geoffrey, May 19 '18 at 07:44
@rustyx not in the target application as it continues to use SIMD directly after this is called. — Geoffrey, May 19 '18 at 08:30
But if the thread is preempted and later continues on another core, you can end up reading stale cache lines. — rustyx, May 19 '18 at 09:07
FWIW, since we're on the topic of how to determine if huge-pages are used, I'll plug a [small utility](https://github.com/travisdowns/page-info) I wrote to determine that programatically and exactly for any memory region. It's useful for this type of benchmark where you want to bail out or log if you didn't get (or, less likely, unexpectedly got) hugepages since that can pretty much invalidate results for the purposes of comparison. It requires root. — BeeOnRope, May 20 '18 at 17:27
@BeeOnRope If that's what you're trying to achieve why not just call `mmap` with `MAP_HUGETLB`? — Geoffrey, May 20 '18 at 22:09
@Geoffrey - MAP_HUGETLB never allocates transparent huge pages. It simply fails on 99% of systems that don't have boot-time huge pages specially configured with hugetlbfs. I'm after transparent huge pages which are always "best effort" even if you ask for them via madvise(). — BeeOnRope, May 20 '18 at 22:51

Peter Cordes · Accepted Answer · 2018-05-19T07:59:24.070

Could it be that the debian libc-i386 is not compiled with SSE support?... Confirmed, objdump shows no SSE used in the memcpy inlined.

GCC treats memcpy as a built-in unless you use -fno-builtin-memcpy; as you saw from perf, no asm implementation in libc.so is even being called. (And gcc can't inline code out of a shared library. glibc headers only have a prototype, not an inline-asm implementation.)

Inlining memcpy as rep movs was purely GCC's idea, with gcc -O3 -m32 -march=znver1. (And the OP reports that -fno-builtin-memcpy sped up this microbenchmark, so apparently glibc's hand-written asm implementation is fine. That's expected; it's probably about the same as 64-bit, and doesn't benefit from more than 8 XMM or YMM registers.)

I would highly recommend against using -fno-builtin-memcpy in general, though, because you definitely want gcc to inline memcpy for stuff like float foo; int32_t bar; memcpy(&foo, &bar, sizeof(foo));. Or other small fixed-size cases where it can inline as a single vector load/store. You definitely want gcc to understand the memcpy just copies memory, and not treat it as an opaque function.

The long-term solution is for gcc to not inline memcpy as rep movs on Zen; apparently that's not a good tuning decision when copies can be large. IDK if it's good for small copies; Intel has significant startup overhead.

The short-term solution is to manually call your custom memcpy (or somehow call non-builtin glibc memcpy) for copies you know are usually large, but let gcc use its builtin for other cases. The super-ugly way would be to use -fno-builtin-memcpy and then use __builtin_memcpy instead of memcpy for small copies.

It looks like for large buffers, rep movs isn't great on Ryzen compared to NT stores. On Intel, I think rep movs is supposed to use a no-RFO protocol similar to NT stores, but maybe AMD is different.

Enhanced REP MOVSB for memcpy only mentions Intel, but it does have some details about bandwidth being limited by memory / L3 latency and max concurrency, rather than actual DRAM controller bandwidth limits.

BTW, does your custom version even check a size threshold before choosing to use NT stores? NT stores suck for small to medium buffers if the data is going to be reloaded again right away; it will have to come from DRAM instead of being an L1d hit.

Re NT stores, no, because the smallest copy size I am dealing with in my application is 4MB and as such it doesn't warrant the extra checks. — Geoffrey, May 19 '18 at 07:53
@Geoffrey: Ok, so you weren't intending to write a general-purpose memcpy. That's fine, but maybe name it `large_copy` or `memcpy_large` then, because it sucks for small copies. — Peter Cordes, May 19 '18 at 07:55

Basile Starynkevitch · Answer 2 · 2018-05-19T06:52:38.330

1

I guess that it could be some CPU cache issue. Remember that access to data in L1 cache is more than a hundred times faster than access to data in your DRAM module.

The fist time any memcpy (yours or the system ones) is called, it brings in cache (and probly even in L1 cache) that memory zone. And block copy has maximal locality.

You should change your code to call several times the same memcpy on the same memory zone, and measure the highest and lowest (and average) times of these memcpy. You'll be surprised.

Otherwise mempcy could be some builtin_memcpy magically known by the GCC compiler or some function provided by your libc. Both your compiler and your GNU libc are free software, so you could study their source code. You could also try some other libc, e.g. musl-libc and some other compiler like Clang/LLVM. And you can also study the assembler code produced (with gcc -S -O3 -fverbose-asm) by your compiler.

At last 44Gbytes/sec vs 29GBytes/sec is not IMHO an abyssal difference.

edited May 19 '18 at 06:52

answered May 19 '18 at 06:40

Basile Starynkevitch

223,805
18
296
547

The memory block size in this test far exceeds the L1 cache size (1-16MB, 128M, 256M, 512M), it is impossible for it to be a caching thing. It also repeats the copy on each block size to transfer a total of 16GB of data. – Geoffrey May 19 '18 at 06:44
But caches play a role even on large memory zone! – Basile Starynkevitch May 19 '18 at 06:45
Correct, but by exceeding the cache size I am forcing a transfer in and out of system RAM, which is what I am measuring here. – Geoffrey May 19 '18 at 06:46
memcpy has a maximal locality. So caches play a major role even on big memcpy. – Basile Starynkevitch May 19 '18 at 06:49
Look at the performance of larger block sizes, 29GB/s vs 44GB/s is not `abyssal` but at 16MB, 38GB/s vs 8GB/s is. – Geoffrey May 19 '18 at 06:51