In what cases should I use memcpy over standard operators in C++?

Question

When can I get better performance using memcpy or how do I benefit from using it? For example:

float a[3]; float b[3];

is code:

memcpy(a, b, 3*sizeof(float));

faster than this one?

a[0] = b[0];
a[1] = b[1];
a[2] = b[2];

I guess even assignment operator for float would be implemented using memcpy. So, directly using memcpy for the entire array would be faster — Akhil, Dec 28 '10 at 08:53
I don;t believe your edit. Why would the second approach be faster. memcpy() is specifically designed to copy areas of memory from one place to another so it should be as efficient as the underlying architecture will allow. I would bet that it will use appropriate assembly where applicable to do a block memory copy. — Martin York, Dec 28 '10 at 09:31

Martin York · Accepted Answer · 2015-08-17T18:02:14.037

63

Efficiency should not be your concern.
Write clean maintainable code.

It bothers me that so many answers indicate that the memcpy() is inefficient. It is designed to be the most efficient way of copy blocks of memory (for C programs).

So I wrote the following as a test:

#include <algorithm>

extern float a[3];
extern float b[3];
extern void base();

int main()
{
    base();

#if defined(M1)
    a[0] = b[0];
    a[1] = b[1];
    a[2] = b[2];
#elif defined(M2)
    memcpy(a, b, 3*sizeof(float));    
#elif defined(M3)
    std::copy(&a[0], &a[3], &b[0]);
 #endif

    base();
}

Then to compare the code produces:

g++ -O3 -S xr.cpp -o s0.s
g++ -O3 -S xr.cpp -o s1.s -DM1
g++ -O3 -S xr.cpp -o s2.s -DM2
g++ -O3 -S xr.cpp -o s3.s -DM3

echo "=======" >  D
diff s0.s s1.s >> D
echo "=======" >> D
diff s0.s s2.s >> D
echo "=======" >> D
diff s0.s s3.s >> D

This resulted in: (comments added by hand)

=======   // Copy by hand
10a11,18
>   movq    _a@GOTPCREL(%rip), %rcx
>   movq    _b@GOTPCREL(%rip), %rdx
>   movl    (%rdx), %eax
>   movl    %eax, (%rcx)
>   movl    4(%rdx), %eax
>   movl    %eax, 4(%rcx)
>   movl    8(%rdx), %eax
>   movl    %eax, 8(%rcx)

=======    // memcpy()
10a11,16
>   movq    _a@GOTPCREL(%rip), %rcx
>   movq    _b@GOTPCREL(%rip), %rdx
>   movq    (%rdx), %rax
>   movq    %rax, (%rcx)
>   movl    8(%rdx), %eax
>   movl    %eax, 8(%rcx)

=======    // std::copy()
10a11,14
>   movq    _a@GOTPCREL(%rip), %rsi
>   movl    $12, %edx
>   movq    _b@GOTPCREL(%rip), %rdi
>   call    _memmove

Added Timing results for running the above inside a loop of 1000000000.

   g++ -c -O3 -DM1 X.cpp
   g++ -O3 X.o base.o -o m1
   g++ -c -O3 -DM2 X.cpp
   g++ -O3 X.o base.o -o m2
   g++ -c -O3 -DM3 X.cpp
   g++ -O3 X.o base.o -o m3
   time ./m1

   real 0m2.486s
   user 0m2.478s
   sys  0m0.005s
   time ./m2

   real 0m1.859s
   user 0m1.853s
   sys  0m0.004s
   time ./m3

   real 0m1.858s
   user 0m1.851s
   sys  0m0.006s

edited Aug 17 '15 at 18:02

answered Dec 28 '10 at 10:08

Martin York

257,169
86
333
562

25

+1. And, since you didn't write down the obvious conclusion from this, the memcpy call looks like it's generating the most efficient code. – Jakob Borg Dec 28 '10 at 10:18
Huh. Why isn’t the call to `_memmove` inlined? – Konrad Rudolph Dec 28 '10 at 11:24
40

BTW: @Martin: it is not reasonable to say "efficiency should not be your concern, write nice code". People use C++ as opposed to a decent language precisely because they demand performance. It matters. – Yttrill Dec 29 '10 at 20:42
8

@Yttrill: And I have never seen a micro optimization by a human that was not already being done better by compiler. On the other hand writing nice readable code implies you are thinking more at the algorithm level were the human can beat the compiler at optimization because the compiler does not know the intent. – Martin York Jul 24 '15 at 17:33
@LokiAstari You think that 20 assign operators are easier to understand than a single memcpy on a struct? It might be more readable with memcpy too. (And in C/++ you are probably used to it anyways.) – akaltar Aug 15 '15 at 23:24
@akaltar: I think `std::copy` is easier to read. It also the most efficient (equal to memcpy). See the supplied assembly and because you obviously did not time it yourself the timing results. – Martin York Aug 17 '15 at 17:59
@akaltar: Also you miss the point. You **CANT** use memcopy on structures in C++ because they have constructors (there is a small subclass of structures that you can use memcopy on but that is not the general case). Which is also why `std::copy` is better as it will use the most efficient and valid technique. – Martin York Aug 17 '15 at 18:09
@LokiAstari memcpy is usually used on data-only structs, as usually those are used in high-performance code, but I understand your point, std::copy is indeed superior. As to reading.. I think they come out equal. – akaltar Aug 17 '15 at 21:31
4

Addendum: instead of C-style arrays, using `std::array`, which *does* have an assignment operator, combines the best of both worlds: [readability and efficiency](https://goo.gl/aUfBHF). And has the extra added quality of not decaying to a pointer, among others. Besides, as of the time of writing, both GCC 5.2 and Clang 3.7 generate identical code in all cases, so performance is no longer relevant and readability should be favored. – user703016 Sep 02 '15 at 06:43
The C++ memmove (m3) speed looks dubious. There's no way a call to `memmove` would have no overhead compared to the optimized and inlined `memcpy` case. – user239558 May 27 '16 at 06:10
@user239558: You are surprised that `std::copy` produces the fastest code? Given the ability of the compiler to analyze and plant the best code it seems redundantly obvious that `std::copy` would be the fastest technique (or at least no slower than a technique you can do manually). – Martin York May 27 '16 at 14:22
1

@LokiAstari the assembly was quoted in the answer above. There is no way a non-inlined call to `memmove`, which in addition to the above needs to check for pointer overlap, could ever be as fast as the inlined `memcpy`. It's bogus. – user239558 May 29 '16 at 21:29
@user239558: Well it's a good job the compiler as a machine makes better choices based on maths than humans do with "opinions". Not only is the code provided but also the timing results. Since this is a science why don't you try and repeat the experiment! – Martin York May 29 '16 at 22:40
@LokiAstari I'm just pointing out that the timings are bogus. Anyone who has programmed in assembly or has read intel instruction manuals will know this. It is common to make mistakes in benchmarking, and I'm simply pointing out an obvious one. I'm not wasting any more time on this, sorry. – user239558 May 29 '16 at 23:47
1

@user239558: Words are easy (rather than just pontificate with no proof Just try it). – Martin York May 30 '16 at 04:02

score 18 · Answer 2 · edited Sep 02 '15 at 06:35

18

You can use memcpy only if the objects you're copying have no explicit constructors, so as their members (so-called POD, "Plain Old Data"). So it is OK to call memcpy for float, but it is wrong for, e.g., std::string.

But part of the work has already been done for you: std::copy from <algorithm> is specialized for built-in types (and possibly for every other POD-type - depends on STL implementation). So writing std::copy(a, a + 3, b) is as fast (after compiler optimization) as memcpy, but is less error-prone.

edited Sep 02 '15 at 06:35

stakx - no longer contributing

83,039
20
168
268

answered Dec 28 '10 at 09:05

crazylammer

1,152
8
7

8

`std::copy` is properly found in ``; `` is strictly for backwards-compatibility. – Karl Knechtel Dec 28 '10 at 09:15

score 12 · Answer 3 · answered Dec 28 '10 at 09:00

12

Compilers specifically optimize memcpy calls, at least clang & gcc does. So you should prefer it wherever you can.

answered Dec 28 '10 at 09:00

ismail

46,010
9
86
95

@ismail : compilers may optimize `memcpy`, but stil it is less likely to be faster than the second approach. Please read Simone's post. – Nawaz Dec 28 '10 at 09:09
1

@Nawaz: I disagree. The memcpy() is likely to be faster given architecture support. Anyway this is redundant as std::copy (as described by @crazylammer) is probably the best solution. – Martin York Dec 28 '10 at 09:34

score 6 · Answer 4 · answered Dec 28 '10 at 09:11

Use std::copy(). As the header file for g++ notes:

This inline function will boil down to a call to @c memmove whenever possible.

Probably, Visual Studio's is not much different. Go with the normal way, and optimize once you're aware of a bottle neck. In the case of a simple copy, the compiler is probably already optimizing for you.

score 5 · Answer 5 · answered Dec 28 '10 at 09:07

5

Don't go for premature micro-optimisations such as using memcpy like this. Using assignment is clearer and less error-prone and any decent compiler will generate suitably efficient code. If, and only if, you have profiled the code and found the assignments to be a significant bottleneck then you can consider some kind of micro-optimisation, but in general you should always write clear, robust code in the first instance.

answered Dec 28 '10 at 09:07

Paul R

208,748
37
389
560

1

How is assigning N (where N > 2) different array items one-by-one clearer than a single `memcpy`? `memcpy(a, b, sizeof a)` is clearer because, if the size of `a` and `b` change, you don't need to add/remove assignments. – Chris Lutz Dec 28 '10 at 10:20
@Chris Lutz: you have to think about the robustness of the code throughout it's lifetime, e.g. what happens if at some point someone changes the declaration of a so that it becomes a pointer instead of an array ? Assignment wouldn't break in this case, but the memcpy would. – Paul R Dec 28 '10 at 10:33
1

`memcpy` wouldn't break (the `sizeof a` trick would break, but only some people use that). Neither would `std::copy`, which is demonstrably superior to both in almost every respect. – Chris Lutz Dec 28 '10 at 10:35
@Chris: well I would rather see a for loop than individual assignments, and of course careful use of memcpy is not off-limits for C code (I would prefer not to see it in C++ code though). But if you work on code that has a long life-cycle or if you care about such things as portability, porting to other languages or compilers, use of code analysis tools, auto-vectorization, etc, then simplicity and clarity are always more important than brevity and low level hacks. – Paul R Dec 28 '10 at 10:46

score 4 · Answer 6 · answered Dec 28 '10 at 09:04

The benefits of memcpy? Probably readability. Otherwise, you would have to either do a number of assignments or have a for loop for copying, neither of which are as simple and clear as just doing memcpy (of course, as long as your types are simple and don't require construction/destruction).

Also, memcpy is generally relatively optimized for specific platforms, to the point that it won't be all that much slower than simple assignment, and may even be faster.

Simone · Answer 7 · 2010-12-28T10:53:48.233

0

Supposedly, as Nawaz said, the assignment version should be faster on most platform. That's because memcpy() will copy byte by byte while the second version could copy 4 bytes at a time.

As it's always the case, you should always profile applications to be sure that what you expect to be the bottleneck matches the reality.

Edit
Same applies to dynamic array. Since you mention C++ you should use std::copy() algorithm in that case.

Edit
This is code output for Windows XP with GCC 4.5.0, compiled with -O3 flag:

extern "C" void cpy(float* d, float* s, size_t n)
{
    memcpy(d, s, sizeof(float)*n);
}

I have done this function because OP specified dynamic arrays too.

Output assembly is the following:

_cpy:
LFB393:
    pushl   %ebp
LCFI0:
    movl    %esp, %ebp
LCFI1:
    pushl   %edi
LCFI2:
    pushl   %esi
LCFI3:
    movl    8(%ebp), %eax
    movl    12(%ebp), %esi
    movl    16(%ebp), %ecx
    sall    $2, %ecx
    movl    %eax, %edi
    rep movsb
    popl    %esi
LCFI4:
    popl    %edi
LCFI5:
    leave
LCFI6:
    ret

of course, I assume all of the experts here knows what rep movsb means.

This is the assignment version:

extern "C" void cpy2(float* d, float* s, size_t n)
{
    while (n > 0) {
        d[n] = s[n];
        n--;
    }
}

which yields the following code:

_cpy2:
LFB394:
    pushl   %ebp
LCFI7:
    movl    %esp, %ebp
LCFI8:
    pushl   %ebx
LCFI9:
    movl    8(%ebp), %ebx
    movl    12(%ebp), %ecx
    movl    16(%ebp), %eax
    testl   %eax, %eax
    je  L2
    .p2align 2,,3
L5:
    movl    (%ecx,%eax,4), %edx
    movl    %edx, (%ebx,%eax,4)
    decl    %eax
    jne L5
L2:
    popl    %ebx
LCFI10:
    leave
LCFI11:
    ret

Which moves 4 bytes at a time.

edited Dec 28 '10 at 10:53

answered Dec 28 '10 at 09:00

Simone

11,655
1
30
43

@Simone : the first para makes sense to me. Now I need to verify it, because I'm not sure. :-) – Nawaz Dec 28 '10 at 09:08
10

I don;t think memcopy copies byte by byte. It is specifically designed to copy large chunks of memory very efficiently. – Martin York Dec 28 '10 at 09:24
Source please? Only thing that POSIX mandates is [this](http://pubs.opengroup.org/onlinepubs/9699919799/functions/memcpy.html). BTW, see if [this implementation](http://www.gnu.org/software/mifluz/doc/doxydoc/memcpy2_8c-source.html) is that fast. – Simone Dec 28 '10 at 09:50
1

@Simone - libc writers have spend a lot of time making sure their `memcpy` implementations are efficient, and compiler writers have spent just as much time making their compilers look for cases when assignments could be made faster by `memcpy` and vice versa. Your argument of "it can be as bad as you want it to" as well as your out-of-the-blue implementation is a red herring. Look at how GCC or other compilers/libc's implement it. That'll probably be fast enough for you. – Chris Lutz Dec 28 '10 at 10:19
7

The usual rule of thumb applies: "Assume library writers aren't brain-damaged". Why would they write a `memcpy` that was only able to copy a byte at a time? – jalf Dec 28 '10 at 10:28
Because `memcpy` is required to be able to copy a single byte. Of course it may check if the size to copy is a multiple of 4 or 8, but with assignments you may omit the check and have faster code. – Simone Dec 28 '10 at 10:31
@Simone - On most modern platforms, the compiler will optimize that check. Most compilers will optimize each individual call to `memcpy` when they can. – Chris Lutz Dec 28 '10 at 10:38
@Chris as you can see, that's not true in every case. – Simone Dec 28 '10 at 11:10
@Simone : one thing : using `while` loop to do the assignments, and doing the assignment manually using constant offset is NOT same speed-wise. The latter is usually faster. – Nawaz Dec 28 '10 at 13:07
Yes, but you can't assign manually if you have a dynamically allocated array. – Simone Dec 28 '10 at 13:09

In what cases should I use memcpy over standard operators in C++?

7 Answers7

Linked