memcpy beats SIMD intrinsics

Question

I have been looking at fast ways to copy various amounts of data, when NEON vector instructions are available on an ARM device.

I've done some benchmarks, and have some interesting results. I'm trying to understand what I'm looking at.

I have got four versions to copy data:

1. Baseline

Copies element by element:

for (int i = 0; i < size; ++i)
{
    copy[i] = orig[i];
}

2. NEON

This code loads four values into a temporary register, then copies the register to output.

Thus the number of loads are reduced by half. There may be a way to skip the temporary register and reduce the loads by one quarter, but I haven't found a way.

int32x4_t tmp;
for (int i = 0; i < size; i += 4)
{
    tmp = vld1q_s32(orig + i); // load 4 elements to tmp SIMD register
    vst1q_s32(&copy2[i], tmp); // copy 4 elements from tmp SIMD register
}

3. Stepped `memcpy`,

Uses the memcpy, but copies 4 elements at a time. This is to compare against the NEON version.

for (int i = 0; i < size; i+=4)
{
    memcpy(orig+i, copy3+i, 4);
}

4. Normal `memcpy`

Uses memcpy with full amount of data.

memcpy(orig, copy4, size);

My benchmark using 2^16 values gave some surprising results:

1. Baseline time = 3443[µs]
2. NEON time = 1682[µs]
3. memcpy (stepped) time = 1445[µs]
4. memcpy time = 81[µs]

The speedup for NEON time is expected, however the faster stepped memcpy time is surprising to me. And the time for 4 even more so.

Why is memcpy doing so well? Does it use NEON under-the-hood? Or are there efficient memory copy instructions I am not aware of?

This question discussed NEON versus memcpy(). However I don't feel the answers explore sufficently why the ARM memcpy implementation runs so well

The full code listing is below:

#include <arm_neon.h>
#include <vector>
#include <cinttypes>

#include <iostream>
#include <cstdlib>
#include <chrono>
#include <cstring>

int main(int argc, char *argv[]) {

    int arr_size;
    if (argc==1)
    {
        std::cout << "Please enter an array size" << std::endl;
        exit(1);
    }

    int size =  atoi(argv[1]); // not very C++, sorry
    std::int32_t* orig = new std::int32_t[size];
    std::int32_t* copy = new std::int32_t[size];
    std::int32_t* copy2 = new std::int32_t[size];
    std::int32_t* copy3 = new std::int32_t[size];
    std::int32_t* copy4 = new std::int32_t[size];


    // Non-neon version
    std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
    for (int i = 0; i < size; ++i)
    {
        copy[i] = orig[i];
    }
    std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
    std::cout << "Baseline time = " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << "[µs]" << std::endl;

    // NEON version
    begin = std::chrono::steady_clock::now();
    int32x4_t tmp;
    for (int i = 0; i < size; i += 4)
    {
        tmp = vld1q_s32(orig + i); // load 4 elements to tmp SIMD register
        vst1q_s32(&copy2[i], tmp); // copy 4 elements from tmp SIMD register
    }
    end = std::chrono::steady_clock::now();
    std::cout << "NEON time = " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << "[µs]" << std::endl;


    // Memcpy example
    begin = std::chrono::steady_clock::now();
    for (int i = 0; i < size; i+=4)
    {
        memcpy(orig+i, copy3+i, 4);
    }
    end = std::chrono::steady_clock::now();
    std::cout << "memcpy time = " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << "[µs]" << std::endl;


    // Memcpy example
    begin = std::chrono::steady_clock::now();
    memcpy(orig, copy4, size);
    end = std::chrono::steady_clock::now();
    std::cout << "memcpy time = " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << "[µs]" << std::endl;

    return 0;
}

People love to optimize memcpy. I would expect it to be fast. — Zan Lynx, Sep 07 '20 at 18:08
Plain `memcpy` call copies 4 times less data than other methods. It takes the size in bytes, but you pass the number of 4-byte elements. Assuming that call even compiles - it refers to a name `i` that doesn't appear to be declared anywhere in the code shown. And you pass parameters the wrong way round: the first one is the destination of the copy, the second is the source. — Igor Tandetnik, Sep 07 '20 at 18:08
Also if you are optimizing your program it might be doing dead code elimination since it doesn't appear to actually do anything. And if you aren't optimizing it, all numbers are worthless anyway since memcpy in the library is definitely optimized. — Zan Lynx, Sep 07 '20 at 18:10
You probably want to use a benchmarking library where you can compile your test object files with `-Ofast` or `-O3`, and the benchmarking main function with `-O0` — Zan Lynx, Sep 07 '20 at 18:12
What you call "stepped `memcpy`" also only copies every 4-th element, and also in the wrong direction. — Igor Tandetnik, Sep 07 '20 at 18:12
Do you have multiple tests you have run or is that the only one? Also, I notice that each of the 4 are run sequentially and after reading the answer to the link you provided they mention to be careful about the warm up. Maybe this could be why in your example memcopy is is doing so well. Try running memcopy first. — Matthew, Sep 07 '20 at 18:15
You're making several basic microbenchmarking mistakes. You aren't warming up your arrays before the timed regions, so you'll get pagefaults when you write them. (Or in the memcpy versions, when you *read* them because you got memcpy args backwards, allowing copy-on-write from the same physical page of zeros, getting L1d hits plus dTLB misses, partly explaining how it can be so much faster.) [Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987) — Peter Cordes, Sep 07 '20 at 18:24
I dont see any assembly code, high level code to read the time. what arm what instruction set, alignment, branch prediction, warming up the caches and branch prediction, cache disabled/enabled. Processor clock speed vs instruction/data memory speeds, etc. I dont see this having much benchmarking value yet, or at least non simd vs simd. — old_timer, Sep 07 '20 at 18:35
what environment are you running this on, what have you done to mitigate that, etc? — old_timer, Sep 07 '20 at 18:36

score 6 · Answer 1 · answered Sep 07 '20 at 19:36

Note: this code uses memcpy in the wrong direction. It should be memcpy(dest, src, num_bytes).

Because the "normal memcpy" test happens last, the massive order of magnitude speedup vs. other tests would be explained by dead code elimination. The optimizer saw that orig is not used after the last memcpy, so it eliminated the memcpy.

A good way to write reliable benchmarks is with the Benchmark framework, and use their benchmark::DoNotOptimize(x) function prevent dead code elimination.

memcpy beats SIMD intrinsics

1. Baseline

2. NEON

3. Stepped memcpy,

4. Normal memcpy

1 Answers1

3. Stepped `memcpy`,

4. Normal `memcpy`