Improving memcpy performance with SIMD instruction set

Question

I got introduced to SIMD insctuction set just recently and as one of my pet projects thought about using it to implement memcpy and see if it performs better than standard memcpy. What I observe is the standard memcpy always performs better than SIMD based custom memcpy. I expected SIMD to have some advantage here. Posting my code and compiling instructions below:

Compilation command:

g++ --std=c++11 memcpy_test.cpp  -mavx2 -O3

code:

#include <iostream>
#include <cstdint>
#include <immintrin.h>
#include <chrono>
#include <cstring>
#include <stdlib.h>

using namespace std;

void mymemcpy(char* dst, char* src, size_t size)
{
    if (dst != src) {
        auto isAligned = [&](uint64_t address) { return (address & 0x1F) == 0; };
        if (isAligned((uint64_t)src) && isAligned((uint64_t)dst)) {
            // std::cout << "Aligned and strting copy" << std::endl;
            const __m256i *s = reinterpret_cast<const __m256i *>(src);
            __m256i *dest = reinterpret_cast<__m256i *>(dst);
            int64_t vectors = size / sizeof(*s);
            int64_t residual = size % sizeof(*s);
            uint64_t vectors_copied = 0;
            for (; vectors > 0; vectors--, s++, dest++) {
            const __m256i loaded = _mm256_stream_load_si256(s);
            _mm256_stream_si256(dest, loaded);
            vectors_copied++;
            }

            // if there are residual bytes, go for usual memcopy
            // cout << "residual : " << residual << endl;
            if (residual != 0) {
            uint64_t offset = vectors_copied * sizeof(*s);
            memcpy(dst + offset, src + offset, size - offset);
            }

            _mm_sfence();
        } else {
            cout << "NOT ALIGNED" << (void *)src << (void *)dst << endl; 
            memcpy(dst, src, size);
        }
    }
}

#define DATA_MB 1 * 1024 * 1024

int main()
{
    using namespace std::chrono;
     
    char *source1 = reinterpret_cast<char *>(aligned_alloc(0x1F, DATA_MB*sizeof(char))); // 2 gb data 
    memset(source1, 0xF, DATA_MB*sizeof(char));
    char *destination1 = reinterpret_cast<char *>(aligned_alloc(0x1F, DATA_MB*sizeof(char))); // 2 gb data
    memset(destination1, 0x00, DATA_MB*sizeof(char));
    cout << "Standard memcpy" << endl;
    auto start1 = high_resolution_clock::now();
    memcpy(destination1, source1, (DATA_MB*sizeof(char)));
    auto stop1 = high_resolution_clock::now();
    auto duration_std = duration_cast<nanoseconds>(stop1 - start1);
    cout << duration_std.count() << endl;
    free(source1);
    free(destination1);

    /* New buffers to avoid cache improvements (if it helps)*/

    char *source = reinterpret_cast<char *>(aligned_alloc(0x1F, DATA_MB*sizeof(char))); // 2 gb data 
    memset(source, 0xF, DATA_MB*sizeof(char));
    char *destination = reinterpret_cast<char *>(aligned_alloc(0x1F, DATA_MB*sizeof(char))); // 2 gb data
    memset(destination, 0x0, DATA_MB*sizeof(char));
    cout << "Custom memcpy" << endl;
    auto start = high_resolution_clock::now();
    mymemcpy(destination, source, (DATA_MB*sizeof(char)));
    auto stop = high_resolution_clock::now();
    auto duration = duration_cast<nanoseconds>(stop - start);
    cout << duration.count() << endl;
    free(source);
    free(destination);

    cout << (duration_std.count() < duration.count()?"standard ":"custom ") << "performed better by " << abs(duration_std.count() - duration.count()) << "ns" << endl;
}

Test machine:

model name      : AMD EPYC 7282 16-Core Processor
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate sme ssbd mba sev ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip rdpid overflow_recov succor smca

What am i doing wrong here? What could possibly cause standard memcpy to perform better than SIMD based custom memcpy? I am very new to SIMD instructions and features it provides so please feel free to enlighten me even with the obvious.

Chances are, `memcpy` already uses all the tricks you can come up with. — Aykhan Hagverdili, Aug 27 '21 at 10:57
The compiler-provided `memcpy` call isn't usually only one function. There might be many different `memcpy` functions, including SIMD based ones, and the compiler could generate calls for different functions depending of how it's used in the code. The functions have also been extensively optimized for many years by experts, and it's going to be very hard to create more optimal versions of the functions. — Some programmer dude, Aug 27 '21 at 10:58
I don't quite see how memory copying could be improved by SIMD. Bottleneck is RAM memory access to which SIMD is unrelated. — ALX23z, Aug 27 '21 at 11:16
memcpy is probably already optimized, just be sure to do release builds and switch on optimizations and choose a correct target cpu version (one that supports the most modern vectorization) — Pepijn Kramer, Aug 27 '21 at 11:23
If you want to free up your CPU for memory copies there is also DMA you can look into — Pepijn Kramer, Aug 27 '21 at 11:25
If you try to disassemble all of this in godbolt.org, you'll see that the compiler uses a lot of other optimizations such as loop unrolling. So it optimizes the code on case-to-case basis. — Lundin, Aug 27 '21 at 11:26
*What could possibly cause standard memcpy to perform better than SIMD based custom memcpy?* My guess is standard memcpy is already SIMD based, and highly refined and finely tuned. — Eljay, Aug 27 '21 at 11:41
@ALX23z memcpy is indeed limited by RAM bandwidth but you can get closer to that limit by using SIMD registers rather than copying by 1, 4 or 8 bytes at a time — Alan Birtles, Aug 27 '21 at 11:53
Compilers are already implementing memcpy with SIMD. If you want to beat them, try ERMSB = enhanced rep movsb instead. Compilers are a lagging behind because ERMSB introduced in Ivy Bridge (2012), and they don’t want to cause performance degradation on old CPUs. — Soonts, Aug 27 '21 at 12:13
@ALX23z: Small but not tiny memcpy (like 4 to 8 KiB) can run as fast as L1d cache, if the source and dst buffers are hot in that innermost cache. It's not rare in practice for memcpy to hit in L2 or at least L3 cache, and the store buffer can help insulate some from the effects of the destination not being hot in cache. (And larger stores mean much more data in the same amount of SB entries, better insulating from stalls / cache misses when reading, similarly fewer ROB entries for loads). Also, even for RAM, scalar can barely keep up with DRAM on a desktop with only one core active. — Peter Cordes, Aug 27 '21 at 13:47
@yashC - have a look at glibc's memcpy for x86-64: it's hand-written in asm, with dynamic dispatch based on CPU features done at dynamic link time. (Since dynamic linking involves a function pointer anyway, there's no extra overhead once the resolver function sets the GOT / PLT.GOT pointer to `memmove-avx-unaligned-erms` - https://code.woboq.org/userspace/glibc/sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S.html / https://code.woboq.org/userspace/glibc/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S.html. Note the `__x86_shared_non_temporal_threshold` and ERMSB threshold. — Peter Cordes, Aug 27 '21 at 13:50
(If you're finding the glibc source code hard to follow, or want to see which version of memcpy actually runs on your machine, single-step into a `memcpy` call GDB, in a program that you compiled with `-fno-plt` for early binding. Or step into the 2nd call so you're not getting the lazy dynamic linking code. Use `layout asm` and `stepi` so you see the real instructions without needing source.) — Peter Cordes, Aug 27 '21 at 13:53
@Soonts: glibc does memcpy dispatching based on CPU features, and CPUID reports presence of ERMSB. So that's not the reason. The real reason is that ERMSB doesn't make it fast for small copies (fast-short-rep is a separate feature new in Ice Lake), and that it's possible to do better with SIMD loops. (With enough code bloat via unrolling; it might be better overall to just use `rep movsb` above 4k or something, even if not locally optimal, to reduce I-cache / uop-cache pollution.) See [Enhanced REP MOVSB for memcpy](https://stackoverflow.com/q/43343231) re: up and downsides. — Peter Cordes, Aug 27 '21 at 13:59
@PeterCordes in the post they test memory copying of 2GB. I didn't refer to `memcpy` that happens to be on cache. — ALX23z, Aug 27 '21 at 15:33
@ALX23z: Ah yes, if we're just talking about this specific benchmark; I didn't read through the code, and there was no table of results for different sizes. Other people often forget about the cached case, and use slow RAM as an excuse to not optimize loops in general. Unless your function is called `memcpy_large`, it had better not suck for the small copies, because actual C `memcpy` has to work well for small, medium, and large copies. So I wanted to make that point in general, especially relevant here for a memcpy that unconditionally uses NT stores (disaster if you're about to reload.) — Peter Cordes, Aug 27 '21 at 16:11
Oh, also just noticed it uses `_mm256_stream_load_si256`. That only does anything special on WC memory (e.g. copying from video RAM back to main memory), otherwise `vmovntdqa` loads are just slower version of `vmovdqa`, including on normal malloced memory (WB = write-back cacheable). At least on Intel; I don't *think* AMD does any cache pollution minimization for NT loads, probably like Intel only for NT prefetches. Unlike stores, it doesn't bypass cache entirely or even override the memory-ordering semantics of the current memory region. [related](//stackoverflow.com/q/32103968/224132) — Peter Cordes, Aug 27 '21 at 16:14
@Someprogrammerdude: I'm not aware of a compiler that statically picks a memcpy implementation based on caller context (e.g. large fixed-size, small unaligned or whatever). For small fixed-size cases, gcc or clang will fully *inline* some mov / movdqu or whatever. The multiple versions thing in glibc is done by dispatching based on CPU features at dynamic-link time, and handling small vs. large and destination misalignment is done by runtime branching inside the hand-written asm memcpy / memmove. What you describe is possible, though; I'm curious if any compilers really do it. — Peter Cordes, Aug 27 '21 at 16:18
@PeterCordes AFAIK intel's compiler used to do it, but only if you compiled with `-mcpu` to tell it the immediate target. But in most cases it would either be an AVX `memcpy` or `rep stosb` sometimes a mix of both depending on the size of the memory. — Mgetz, Aug 27 '21 at 16:40
@PeterCordes I read through the links and understand that if the cpu supports avx, then at linking time memcpy with appropriate support is called. I tried the gdb exercise and see that for my system (supports avx, avx2 and upto sse4_a, with libc 2.27 - g++ 9.4), the memcpy function called is `ssse3.S` one and not avx one as i expect. any idea why that might be happening? — yashC, Sep 02 '21 at 16:37
@yashC: glibc picks at *dynamic* link time (i.e. every time you run the program), unless you're using `-static` or something to not do dynamic linking at all. On my Arch Linux system if I single-step into a call to memcpy it's the AVX version. (See also [perf report shows this function "\_\_memset\_avx2\_unaligned\_erms" has overhead. does this mean memory is unaligned?](https://stackoverflow.com/q/51614543) for an example of it choosing the AVX2 version of memset) — Peter Cordes, Sep 02 '21 at 18:12
@PeterCordes i was able to get avx unaligned version to work but that was on a different system (Intel). Not sure why at linking time for AMD linker goes with SSE3... both the systems have all the (sse and avx) flags. worth noting that intel also has erms flag. though i dont think that is hard req for AVX based memcpy. — yashC, Sep 03 '21 at 07:20
Right, ERMSB means that `rep movsb` might be worth using; on CPUs where it's not, glibc hopefully just sets the ERMSB size threshold variable such that it won't ever be used. Perhaps your glibc version's dispatching choices haven't been updated for Zen 2; before that, AMD split 256-bit loads into two 128-bit halves, so dispatching to latest non-AVX version made sense. (32-byte vectors won't help bandwidth on Zen 1, and can make it worse if unaligned, and/or for small-size copies.) But YMM vectors will help on your Zen2. — Peter Cordes, Sep 03 '21 at 07:36
So yes, you could work around that with manually-vectorized YMM loops, at least if you avoid [Why doesn't gcc resolve \_mm256\_loadu\_pd as single vmovupd?](https://stackoverflow.com/q/52626726) GCC's bad tune=generic defaults for modern CPUs. But if you're going to use NT stores, make sure you don't use this for small copies. (And prefer unrolling into at least whole-cache-line blocks of loads and stores). Also, for very large copies bound by DRAM bandwidth not cache, 16-byte vectors are barely any slower. — Peter Cordes, Sep 03 '21 at 07:41
Also of course don't use NT loads like I mentioned earlier; on normal (WB) memory regions they just fall back to a slower version of a normal load. — Peter Cordes, Sep 03 '21 at 07:43
@PeterCordes let me process that information. I have been doing another experiment. Why not write a loop to do copy byte by byte and then compile with -march=native flag and let g++ do the optimization. Results - on intel it performs better for small size (<100kb) there is significant gain (30% to 1000%) with smaller the size better the performance. after that it is the same as memcpy. But on amd little gain and after some point degradation of performance. Interesting. I dont think it is very smart to take this approach. — yashC, Sep 03 '21 at 08:18
*Why not write a loop to do copy byte by byte* - GCC will recognize that as a memcpy. For small compile-time-constant sizes, GCC will inline memcpy (hopefully with SIMD instructions), otherwise emit a call to the function. Check the generated asm; it's possible GCC's inline memcpy strategy is poorly tuned for AMD. Especially if `-march=native` doesn't actually recognize your CPU specifically, in which case it will enable the features but use `tune=generic`. But that shouldn't be the case; g++9.4 recognizes `-march=znver2` https://godbolt.org/z/d3bf6bKe6 — Peter Cordes, Sep 03 '21 at 08:37

Improving memcpy performance with SIMD instruction set

0 Answers0