Why is `std::copy` 5x (!) slower than `memcpy` for reading one int from a char buffer, in my test program?

Question

This is a follow-up to this question where I posted this program:

#include <algorithm>
#include <cstdlib>
#include <cstdio>
#include <cstring>
#include <ctime>
#include <iomanip>
#include <iostream>
#include <vector>
#include <chrono>

class Stopwatch
{
public:
    typedef std::chrono::high_resolution_clock Clock;

    //! Constructor starts the stopwatch
    Stopwatch() : mStart(Clock::now())
    {
    }

    //! Returns elapsed number of seconds in decimal form.
    double elapsed()
    {
        return 1.0 * (Clock::now() - mStart).count() / Clock::period::den;
    }

    Clock::time_point mStart;
};

struct test_cast
{
    int operator()(const char * data) const
    {
        return *((int*)data);
    }
};

struct test_memcpy
{
    int operator()(const char * data) const
    {
        int result;
        memcpy(&result, data, sizeof(result));
        return result;
    }
};

struct test_memmove
{
    int operator()(const char * data) const
    {
        int result;
        memmove(&result, data, sizeof(result));
        return result;
    }
};

struct test_std_copy
{
    int operator()(const char * data) const
    {
        int result;
        std::copy(data, data + sizeof(int), reinterpret_cast<char *>(&result));
        return result;
    }
};

enum
{
    iterations = 2000,
    container_size = 2000
};

//! Returns a list of integers in binary form.
std::vector<char> get_binary_data()
{
    std::vector<char> bytes(sizeof(int) * container_size);
    for (std::vector<int>::size_type i = 0; i != bytes.size(); i += sizeof(int))
    {
        memcpy(&bytes[i], &i, sizeof(i));
    }
    return bytes;
}

template<typename Function>
unsigned benchmark(const Function & function, unsigned & counter)
{
    std::vector<char> binary_data = get_binary_data();
    Stopwatch sw;
    for (unsigned iter = 0; iter != iterations; ++iter)
    {
        for (unsigned i = 0; i != binary_data.size(); i += 4)
        {
            const char * c = reinterpret_cast<const char*>(&binary_data[i]);
            counter += function(c);
        }
    }
    return unsigned(0.5 + 1000.0 * sw.elapsed());
}

int main()
{
    srand(time(0));
    unsigned counter = 0;

    std::cout << "cast:      " << benchmark(test_cast(),     counter) << " ms" << std::endl;
    std::cout << "memcpy:    " << benchmark(test_memcpy(),   counter) << " ms" << std::endl;
    std::cout << "memmove:   " << benchmark(test_memmove(),  counter) << " ms" << std::endl;
    std::cout << "std::copy: " << benchmark(test_std_copy(), counter) << " ms" << std::endl;
    std::cout << "(counter:  " << counter << ")" << std::endl << std::endl;

}

I noticed that for some reason std::copy performs much worse than memcpy. The output looks like this on my Mac using gcc 4.7.

g++ -o test -std=c++0x -O0 -Wall -Werror -Wextra -pedantic-errors main.cpp
cast:      41 ms
memcpy:    46 ms
memmove:   53 ms
std::copy: 211 ms
(counter:  3838457856)

g++ -o test -std=c++0x -O1 -Wall -Werror -Wextra -pedantic-errors main.cpp
cast:      8 ms
memcpy:    7 ms
memmove:   8 ms
std::copy: 19 ms
(counter:  3838457856)

g++ -o test -std=c++0x -O2 -Wall -Werror -Wextra -pedantic-errors main.cpp
cast:      3 ms
memcpy:    2 ms
memmove:   3 ms
std::copy: 27 ms
(counter:  3838457856)

g++ -o test -std=c++0x -O3 -Wall -Werror -Wextra -pedantic-errors main.cpp
cast:      2 ms
memcpy:    2 ms
memmove:   3 ms
std::copy: 16 ms
(counter:  3838457856)

As you can see, even with -O3it is up to 5 times (!) slower than memcpy.

The results are similar on Linux.

Does anyone know why?

since the final question involves only standard c++, why not provide a portable example? — Cheers and hth. - Alf, Oct 29 '12 at 19:41
Some disassembly dumps would be the first step in answering. — aschepler, Oct 29 '12 at 19:44
This code is pretty damn confusing consider it is just timing to functions, I am not surprised the optimiser is having a hard time with it — 111111, Oct 29 '12 at 19:45
I do think that functors and templates is complete overkill for this benchmark. Just write 3 loops. — Mysticial, Oct 29 '12 at 19:46
further more what exactly is this code supposed to do? std::copy isn't for assign ints. — 111111, Oct 29 '12 at 19:52
Using gcc v4.6.2 with `-O3 -S -masm=intel`, the `memcpy` test boils down to a single `add edx, [esi+ecx*4]` whereas `std::copy` uses a slower `rep movsb`. I suppose that's what you get for asking it to copy chars... (Clang seems to compile both versions the same way.) — DCoder, Oct 29 '12 at 20:01
@111111 The original use case was to compare various means of parsing a list of integers from a block of binary data. In this benchmark I wanted to test if the smelly `*((int*)ptr)` was any faster than a more correct `memcpy`-based solution. I added `std::copy` later out of curiosity and for completeness' sake. — StackedCrooked, Oct 29 '12 at 20:01
@StackedCrooked the point is, you using a range base copy algorithms to the job of a non-converting-assignment and wondering why it is slower. — 111111, Oct 29 '12 at 20:10
I was under the impression that std::copy was the preferred way of copying data in C++ and memcpy was there for backwards compatibility with C. That was my wrong assumption then? — StackedCrooked, Oct 29 '12 at 20:14
@StackedCrooked: It's the preferred way in general, because it works over any iterator types over any ranges. `memcpy` works standard-layout types and non-overlapping ranges. In principle the compiler *could* deduce that your `std::copy` call is able to be implemented as a call to `memcpy`; but compiler writers have lots of other things to worry about too, and the same person who's going to notice the performance hit is also going to be willing to accept the simple fix, so it's probably not a top priority. The C equivalent to `std::copy` is `memmove`. — GManNickG, Oct 29 '12 at 20:23

score 11 · Answer 1 · edited May 23 '17 at 10:28

I agree with @rici's comment about developing a more meaningful benchmark so I rewrote your test to benchmark copying of two vectors using memcpy(), memmove(), std::copy() and the std::vector assignment operator:

#include <algorithm>
#include <iostream>
#include <vector>
#include <chrono>
#include <random>
#include <cstring>
#include <cassert>

typedef std::vector<int> vector_type;

void test_memcpy(vector_type & destv, vector_type const & srcv)
{
    vector_type::pointer       const dest = destv.data();
    vector_type::const_pointer const src  = srcv.data();

    std::memcpy(dest, src, srcv.size() * sizeof(vector_type::value_type));
}

void test_memmove(vector_type & destv, vector_type const & srcv)
{
    vector_type::pointer       const dest = destv.data();
    vector_type::const_pointer const src  = srcv.data();

    std::memmove(dest, src, srcv.size() * sizeof(vector_type::value_type));
}

void test_std_copy(vector_type & dest, vector_type const & src)
{
    std::copy(src.begin(), src.end(), dest.begin());
}

void test_assignment(vector_type & dest, vector_type const & src)
{
    dest = src;
}

auto
benchmark(std::function<void(vector_type &, vector_type const &)> copy_func)
    ->decltype(std::chrono::milliseconds().count())
{
    std::random_device rd;
    std::mt19937 generator(rd());
    std::uniform_int_distribution<vector_type::value_type> distribution;

    static vector_type::size_type const num_elems = 2000;

    vector_type dest(num_elems);
    vector_type src(num_elems);

    // Fill the source and destination vectors with random data.
    for (vector_type::size_type i = 0; i < num_elems; ++i) {
        src.push_back(distribution(generator));
        dest.push_back(distribution(generator));
    }

    static int const iterations = 50000;

    std::chrono::time_point<std::chrono::system_clock> start, end;

    start = std::chrono::system_clock::now();

    for (int i = 0; i != iterations; ++i)
        copy_func(dest, src);

    end = std::chrono::system_clock::now();

    assert(src == dest);

    return
        std::chrono::duration_cast<std::chrono::milliseconds>(
            end - start).count();
}

int main()
{
    std::cout
        << "memcpy:     " << benchmark(test_memcpy)     << " ms" << std::endl
        << "memmove:    " << benchmark(test_memmove)    << " ms" << std::endl
        << "std::copy:  " << benchmark(test_std_copy)   << " ms" << std::endl
        << "assignment: " << benchmark(test_assignment) << " ms" << std::endl
        << std::endl;
}

I went a little overboard with C++11 just for fun.

Here are the results I get on my 64 bit Ubuntu box with g++ 4.6.3:

$ g++ -O3 -std=c++0x foo.cpp ; ./a.out 
memcpy:     33 ms
memmove:    33 ms
std::copy:  33 ms
assignment: 34 ms

The results are all quite comparable! I get comparable times in all test cases when I change the integer type, e.g. to long long, in the vector as well.

Unless my benchmark rewrite is broken, it looks like your own benchmark isn't performing a valid comparison. HTH!

Interesting! One minor point: capturing the function as a template argument, e.g. `template auto benchmark(const F & copy_func)` makes the invocation a little faster because it allows inlining (std::function has type-erasure which prevents inlining). This leads to more accurate results because the invocation cost of std::function can "dilute" the time differences of the actual copy function. I tried it and got more diverse results: memcpy: 24 ms, memmove: 22 ms, std::copy: 19 ms, assignment: 22 ms. However, they are still pretty close together and std::copy is the fastest now. — StackedCrooked, Oct 30 '12 at 00:09
@StackedCrooked: Interesting point about the type erasure. Out of curiosity, is `std::copy` consistently the fastest with your function template argument change in place? I always get comparable results in between runs but the fastest case isn't always the same, unsurprisingly. — Void - Othman, Oct 30 '12 at 00:22
It get [different results](http://i.imgur.com/fk4Up.png) each time. I suspect that GCC might be optimizing some of the parts that we want to measure. I don't really know how to check that. — StackedCrooked, Oct 30 '12 at 01:10

score 8 · Answer 2 · answered Oct 29 '12 at 21:16

Looks to me like the answer is that gcc can optimize these particular calls to memmove and memcpy, but not std::copy. gcc is aware of the semantics of memmove and memcpy, and in this case can take advantage of the fact that the size is known (sizeof(int)) to turn the call into a single mov instruction.

std::copy is implemented in terms of memcpy, but apparently the gcc optimizer doesn't manage to figure out that data + sizeof(int) - data is exactly sizeof(int). So the benchmark calls memcpy.

I got all that by invoking gcc with -S and flipping quickly through the output; I could easily have gotten it wrong, but what I saw seems consistent with your measurements.

By the way, I think the test is more or less meaningless. A more plausible real-world test might be creating an actual vector<int> src and an int[N] dst, and then comparing memcpy(dst, src.data(), sizeof(int)*src.size()) with std::copy(src.begin(), src.end(), &dst).

The test is based on a real-life use case where network headers (which often contain integral fields) are parsed from incoming network bytes. — StackedCrooked, Oct 29 '12 at 21:37
@StackedCrooked: and you do that without worrying about byte order? Anyway, what you're measuring is most likely the compiler's ability to optimize particular invocations. The memcpy invocation you're giving it is certainly the easiest possible one to optimize. If that's what you're going to end up doing, fine, but if the size of the integral type varies, you might want to benchmark a mix of sizes. It will depend on how you invoke memcpy, too: particularly, whether or not the length is a compile-time constant. — rici, Oct 30 '12 at 03:35

Jesus Ramos · Answer 3 · 2012-10-29T20:32:11.423

3

memcpy and std::copy each have their uses, std::copy should(as pointed out by Cheers below) be as slow as memmove because there is no guarantee the memory regions will overlap. This means you can copy non-contiguous regions very easily (as it supports iterators) (think of sparsely allocated structures like linked list etc.... even custom classes/structures that implement iterators). memcpy only work on contiguous reasons and as such can be heavily optimized.

edited Oct 29 '12 at 20:32

answered Oct 29 '12 at 19:41

Jesus Ramos

22,940
10
58
88

7

-1 `std::copy` is a function template and as such does not use non-pointer iterators when you provide pointer arguments. also, it can be and should be specialized for the case of copying PODs. however, when the programmer uses `memcpy` as opposed to `memmove`, the programmer is stating that he/she knows that the memory areas are not overlapping, and AFAIK there's no way to state that to `std::copy`. thus, `std::copy` should ideally be on a par with `memmove`, and with `memcpy` *slightly* faster. that it currently isn't, with at at least some popular implementations, says something. – Cheers and hth. - Alf Oct 29 '12 at 19:48
@Cheersandhth.-Alf From the code I checked in libstdc++ I cannot find a specialization for PODs which is why I wrote this answer, either that or the compiler is just bad at optimizing the template code possibly. Can you explain the reason for your -1 a little more please? – Jesus Ramos Oct 29 '12 at 20:05
JesusRamos, "std::copy is slower because of it's use of iterators." is incorrect. in the code given it uses just ordinary pointers. – Cheers and hth. - Alf Oct 29 '12 at 20:27
3

@JesusRamos: The specialization in `libstdc++` begins in its `std::copy()` implementation [here](http://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libstdc%2B%2B-v3/include/bits/stl_algobase.h;h=fe30f6ce9f5e7c2931cfb424257f2e27ca4a7650;hb=HEAD#l432). Ultimately it should get to [this specialization](http://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libstdc%2B%2B-v3/include/bits/stl_algobase.h;h=fe30f6ce9f5e7c2931cfb424257f2e27ca4a7650;hb=HEAD#l364) which contains the underlying call to `memmove()`. I'm assuming you're referring to the GNU `libstdc++` implementation. HTH! – Void - Othman Oct 29 '12 at 20:27
@Void Thanks, I missed out when reading that. I was looking for a memcpy optimization instead and forgot about memmove. – Jesus Ramos Oct 29 '12 at 20:29
@Cheersandhth.-Alf Oh okay, I was thinking for a more general purpose approach to using std::copy, I never really use it for raw pointers. Will update answer to reflect that. – Jesus Ramos Oct 29 '12 at 20:30
@Cheersandhth.-Alf, "AFAIK there's no way to state that to std::copy", I think `std::uninitialized_copy` could be interpreted to imply `memcpy` (among other things in other contexts) since the memory has to have been just allocated or at least different from the source memory. Although I am confused what exactly these implementations are doing (I did my own test above). – alfC Feb 20 '20 at 03:42

Martin York · Accepted Answer · 2012-10-29T20:24:11.293

3

That is not the results I get:

> g++ -O3 XX.cpp 
> ./a.out
cast:      5 ms
memcpy:    4 ms
std::copy: 3 ms
(counter:  1264720400)

Hardware: 2GHz Intel Core i7
Memory:   8G 1333 MHz DDR3
OS:       Max OS X 10.7.5
Compiler: i686-apple-darwin11-llvm-g++-4.2 (GCC) 4.2.1

On a Linux box I get different results:

> g++ -std=c++0x -O3 XX.cpp 
> ./a.out 
cast:      3 ms
memcpy:    4 ms
std::copy: 21 ms
(counter:  731359744)


Hardware:  Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
Memory:    61363780 kB
OS:        Linux ip-10-58-154-83 3.2.0-29-virtual #46-Ubuntu SMP
Compiler:  g++ (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3

edited Oct 29 '12 at 20:24

answered Oct 29 '12 at 20:16

Martin York

257,169
86
333
562

Interesting. I tested on MacPorts GCC 4.7 and GCC 4.6.2 on Ubuntu. – StackedCrooked Oct 29 '12 at 20:17
1

Even though you didn't formulate an answer I consider this post to be most informative. – StackedCrooked Dec 07 '12 at 20:42

score 1 · Answer 5 · answered Nov 10 '13 at 03:10

According to assembler output of G++ 4.8.1, test_memcpy:

movl    (%r15), %r15d

test_std_copy:

movl    $4, %edx
movq    %r15, %rsi
leaq    16(%rsp), %rdi
call    memcpy

As you can see, std::copy successfully recognized that it can copy data with memcpy, but for some reason further inlining did not happen - so that is the reason of performance difference.

By the way, Clang 3.4 produces identical code for both cases:

movl    (%r14,%rbx), %ebp

alfC · Answer 6 · 2020-02-20T06:14:42.697

0

EDIT: I leave this answer for reference, the odd timings with gcc seem to be an artifact of "code alignment" (see comments)

I was about to say that this was an implementation glitch in gcc 4 at the time, but it might be more complicated than that. My results are (used 20000/20000 for the counters):

$ g++ -Ofast a.cpp; ./a.out
cast:      24 ms
memcpy:    47 ms
memmove:   24 ms
std::copy: 24 ms
(counter:  1787289600)

$ g++ -O3 a.cpp; ./a.out
cast:      24 ms
memcpy:    24 ms
memmove:   24 ms
std::copy: 47 ms
(counter:  1787289600)

$ g++ --version
g++ (Ubuntu 9.2.1-9ubuntu2) 9.2.1 20191008

Notice how copy and memcpy results swap when compiling with -O3 and -Ofast. Also memmove is not slower than either.

In clang the results are simpler:

$ clang++ -O3 a.cpp; ./a.out
cast:      26 ms
memcpy:    26 ms
memmove:   26 ms
std::copy: 26 ms
(counter:  1787289600)

$ clang++ -Ofast a.cpp; ./a.out
cast:      26 ms
memcpy:    26 ms
memmove:   26 ms
std::copy: 26 ms
(counter:  1787289600)

$ clang++ --version
clang version 9.0.0-2 (tags/RELEASE_900/final)

perf results: https://pastebin.com/BZCZiAWQ

edited Feb 20 '20 at 06:14

answered Feb 20 '20 at 03:36

alfC

14,261
4
67
118

Is that swap repeatable? `gcc -Ofast` is just `gcc -O3 -ffast-math`, so unless optimizing `double elapsed()` is somehow relevant (indirectly for code-alignment effects on later loops?) it's more likely just measurement noise. Re-run your test multiple times, e.g. under `perf stat -r 10 ./a.out` to run the program 10 times. (And record HW perf counter averages for the *total* run time.) – Peter Cordes Feb 20 '20 at 03:54
@PeterCordes Yes, it is repeatable, I am not an expert on benchmarks but I stabilized my CPU frequency for the test. Running with `perf` agrees, the results gets swapped (with gcc). – alfC Feb 20 '20 at 04:03
What CPU are you testing on? On my i7-6700k (Skylake) at 3.9GHz, I had to bump up the repeat counts to get meaningful results. (It does scale linearly with `iterations = 200000` being 10x faster than `2000000`, with the OP's tiny `container_size = 2000` that's probably giving all L1d hits). cast and memmove at 516ms, taking ~twice as long as the other two at 260ms. So clearly there's a huge flaw somewhere, perhaps in memory alignment / 4k aliasing. Those timings are identical with both `-O3` and `-Ofast`, with g++ 9.2.0 on Arch Linux. – Peter Cordes Feb 20 '20 at 04:26
Oh lol, it turns out that all 4 loops got optimized to pure ALU SIMD adds, adding up `sum(0..n-1)`, and the perf difference was merely from code alignment effects (Skylake is sensitive to having 1 cycle / iteration tight loop in one uop-cache, which can only cache instructions from one 32-byte block of machine code. So loops that span a 32-byte boundary run slower. And with the microcode update for the JCC erratum, loops that end at the end of a 32-byte block are also a problem.) So there's absolutely nothing to see here; GCC did constant-propagation from the memcpy array init! – Peter Cordes Feb 20 '20 at 04:51
See the loops containing `padd` in https://godbolt.org/z/-KZfWW. Those are where `perf record` says all the time was spent on my desktop. – Peter Cordes Feb 20 '20 at 04:52
@PeterCordes Good. But what is the conclusion at the end? I posted my perf results in pastebin (see edit). – alfC Feb 20 '20 at 05:46
1

On your compiler, `-ffast-math` presumably changes code-size of something, leading to different alignment for the loops. On your CPU (whatever microarchitecture it is; you still haven't said) with that combination of alignments, you happen to get unlucky in one of four cases each time, but a different one depending on how the `double` math compiles and/or linking in the fast-math CRT startup code. Clang unrolls tiny loops by default, removing the potential front-end bottleneck. – Peter Cordes Feb 20 '20 at 05:53
@PeterCordes, ok, I understand, thank you, otherwise the coincidence was astonishing. My computer is `model name : Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz stepping : 10 microcode : 0xca `. I think it is better if we move these to a real benchmark platform. – alfC Feb 20 '20 at 06:04
Eh, why bother? This proves that modern GCC and clang fully see through all 4 versions. Although I guess it's possible that code-gen could differ if it wasn't doing constant-propagation through them. But in practice I don't expect that. Still, probably worth making something `volatile` or random in the init loop to hide the constantness from the optimizer, and see if we get a normal auto-vectorized sum of the array. – Peter Cordes Feb 20 '20 at 06:12
@PeterCordes, well just not to repeat the (my) mistake. Probably Google Benchmark will do some tricks to avoid these issues. I think your theory can explain many of the odd results in this SO thread. What is interesting is that in spite of the comments around the code compiles to `memcpy` and not to `memmove`, probably the compiler can "see" the non-overlapping memory ranges. – alfC Feb 20 '20 at 06:30
The real question is: "what do you *want* to measure?" The OP's ultimate goal was a safe and efficient way in C to express an unaligned load into an `int` from a char buffer. `memcpy` is the answer; it compiles efficiently, same as the other methods. (Also, reinterpret and std::copy are UB on misaligned `int*`, which I think is why they did this test where every access to the char vector was in fact int aligned, so they could compare things that are expected to be efficient). – Peter Cordes Feb 20 '20 at 07:01
Re: compiling to memcpy: you're talking about the old GCC4.6/7 missed optimization with std::copy in other answers? In https://godbolt.org/z/XSFJ2R (this code with gcc9.2) the only occurrence of "memcpy" is the string literal. https://en.cppreference.com/w/cpp/algorithm/copy says that std::copy has UB if the *first* destination element overlaps the source range, and you have to use copy_backwards manually. memcpy is UB on *any* overlap (even end of dst), but I think gcc (or libstdc++'s copy if it's in the source explicitly) knows that glibc memcpy is safe for that case. – Peter Cordes Feb 20 '20 at 07:06
Ok, thanks for the information. I didn't appreciate the question well. Here it is a test that confirms that copy/memcpy/memmove basically give the same time. http://quick-bench.com/VR1USQOfdoXQQ-k1GoVE-AI0mP0 – alfC Feb 20 '20 at 07:46
Semi-related: [Why is std::fill(0) slower than std::fill(1)?](//stackoverflow.com/q/42558907) - for filling with known patterns, gcc can do better (for generic tuning running on recent CPUs) when it's a single-byte repeat, otherwise it only uses normal auto-vectorization. i.e. it doesn't know how to detect patterns that could use `wmemset`. But that's irrelevant for *copying*. When I started looking for that, I was thinking the Q&A I was remembering was more related, but it's still related enough to link anyway. – Peter Cordes Feb 20 '20 at 07:50

Why is `std::copy` 5x (!) slower than `memcpy` for reading one int from a char buffer, in my test program?

6 Answers6

Linked