Which one to use - memmove() or memcpy() - when buffers don't overlap?

Question

Using memcpy() when source and destination overlap can lead to undefined behaviour - in those cases only memmove() can be used.

But what if I know for sure buffers don't overlap - is there a reason to use specifically memcpy() or specifically memmove()? Which should I use and why?

@Matt Joiner: Could you please explain why you dislike `std::copy()` so much? — sharptooth, Sep 13 '10 at 14:04
It's sugar coating, and brings no particular performance improvement. Furthermore it's harder to parse with the eye, and occupies far more space (a std::copy can occupy 160 characters). The only benefit is the fact it wraps a loop for you, which is easy to get wrong. But chances are people who are aware of std::copy, are able to get a loop right. — Matt Joiner, Sep 14 '10 at 02:52
Careful with that assertion about performance, @MattJoiner. http://stackoverflow.com/questions/4707012/c-memcpy-vs-stdcopy/9980859#9980859 — David Stone, Apr 05 '12 at 06:25
@MattJoiner: Although this is an old question, I wanted to mention that the closest equivalent to `std::memmove`/`std::memcpy`, that the STL has is not `std::copy`, but `std::copy_n`, which can accept the same input arguments, might even be less typing, because you don't need `sizeof` and - in addition - works on any type. And btw: I really don't understand, how you get the 160 characters. — MikeMB, Sep 27 '15 at 19:10

Stephen Canon · Accepted Answer · 2009-12-25T14:36:01.817

34

Assuming a sane library implementor, memcpy will always be at least as fast as memmove. However, on most platforms the difference will be minimal, and on many platforms memcpy is just an alias for memmove to support legacy code that (incorrectly) calls memcpy on overlapping buffers.

Both memcpy and memmove should be written to take advantage of the fastest loads and stores available on the platform.

To answer your question: you should use the one that is semantically correct. If you can guarantee that the buffers do not overlap, you should use memcpy. If you cannot guarantee that the buffers don't overlap, you should use memmove.

edited Dec 25 '09 at 14:36

answered Dec 25 '09 at 14:30

Stephen Canon

103,815
19
183
269

+1. I especially like the "assuming sane" counterpoint to my own answer :-) – paxdiablo Dec 26 '09 at 00:37
Nitpick: `memcpy` and `memmove` should be written to take advantage of the fastest `unaligned` loads and stores available on the platform. If you know your buffers are aligned properly, you can often get much better performance using things like MMX, which copy much larger data units at a time. – Adam Rosenfield Dec 26 '09 at 01:40
1

@Adam: Generally speaking one can arrange to use aligned loads and stores in memcopy by first copying some smaller units to achieve appropriate alignment. If the buffers do not have similar alignment, it will be necessary to apply some shift or permute before storing, but this is faster than using unaligned memory accesses on many architectures. – Stephen Canon Dec 26 '09 at 02:23
@StephenCanon: I wonder if there would be any problem with a future C standard picking some never-before-used identifier and saying that any program which defines a function with the name `__uint32_copy_xy193qrq91` [or other similar names for other types] must implement it to have some particular semantics, and then defining that as the name for a new standard method to copy aligned `int32` data. Doing that would make it possible to write code which would work correctly on old compilers, but could achieve faster performance than `memcpy` on newer compilers [memcpy() will often be... – supercat Dec 06 '14 at 19:09
...inefficient than a simple uint32-copying loop when copying just a few words of data when the programmer--but not the compiler--knows such a loop would work. Implementations of memcpy() are often optimized for the aligned-to-aligned scenario, but the time required to test for alignment might exceed the time required to actually copy a few words of data. – supercat Dec 06 '14 at 19:13
@supercat: Since the semantics of `memcpy` are completely defined, compilers are already free to do exactly that. I don't see why you would need a new function name. – Stephen Canon Dec 07 '14 at 00:48
@StephenCanon: For cases where the programmer knows that a pointer will always be word-aligned but the compiler can't know that, there's no way a legitimate memcpy() implementation could omit the code necessary to determine alignment. On the ARM7-TDMI, a 16-byte aligned copy would be two instructions totaling 12 cycles; no legitimate memcpy() could come close. – supercat Dec 08 '14 at 05:10
@supercat: Right, but the compiler is already allowed to look at a call to `memcpy` (or a loop copying by `uint32_t`s), recognize that the source and destination are both suitably aligned, and emit a call to `__builtin_memcpy_4_byte_aligned` or just lower to an inline copy sequence, or whatever else it wants to do. There's no need to add a new function name to the standard. – Stephen Canon Dec 08 '14 at 16:23
The programmer can provide this info to the compiler if the compiler couldn't otherwise deduce it by casting the memcpy arguments to suitably aligned pointer types. – Stephen Canon Dec 08 '14 at 16:24
@StephenCanon: If one has e.g. a 32-bit-aligned pointer to a sequence of structures holding four `uint16_t` values each, would one have to cast the address to `uint32_t*` and then to `unsigned char*` to avoid aliasing issues? – supercat Dec 08 '14 at 17:11

qrdl · Answer 2 · 2009-12-25T14:47:56.443

30

memcpy() doesn't have any special handling for overlapping buffers so it lacks some checks therefore it is faster than memmove().

Also on some architectures memcpy() can benefit from using CPU instructions for moving blocks of memory - something that memmove() cannot use.

edited Dec 25 '09 at 14:47

answered Dec 25 '09 at 11:14

qrdl

34,062
14
56
86

Even on a RISC architecture, there are often block-move operations from which memcpy() can benefit. PowerPC has VMX, for example. – Crashworks Dec 25 '09 at 12:56
Nah, decent code generators produce rep movs after checking for no overlap. MSVC does. – Hans Passant Dec 25 '09 at 13:03
@nobugz Not always at compile time you can determine whether buffers overlap or not. Or did you mean checking at run time? – qrdl Dec 25 '09 at 13:12
@Crashworks Interesting, didn't know that. Seems I had experience with more RISCy RISCs, that have just load/store instructions to access memory. – qrdl Dec 25 '09 at 13:23
@qrdl: RISC doesn't have rep movs, but many RISC architectures have vector registers that are wider than the scalar core registers, and have a correspondingly wider path to/from memory. – Stephen Canon Dec 25 '09 at 14:35
Obviously I was wrong about RISC architectures - I removed that bit – qrdl Dec 25 '09 at 14:48
A few of the newer optimizations for memcpy() right now on modern CPU's are cache-based. Either using a temporary cache area for reads, cache prefetching from source, cache-zeroing on destination (dcbz on PPC), etc. Some CPU's also have "DMA-like" extensions for fully asynchronous copying. A good implementation of memmove will use the same optimized code as memcpy but it does indeed require a check to see if they overlap first so it will be slightly slower if you already know no overlap exists. – Adisak Dec 26 '09 at 00:40
`memmove()` can literally be implemented in terms of `memcpy` for the non-overlapping buffer case. So the slowdown is effectively two conditions and a call or so (if we discount inlining and manual optimized versions, which are the norm). Something that will often by dwarfed by unaligned copies and such ... so the argument is that for the harmless case there will be a hard to measure performance hit and for the harmful case (passing overlapping buffers to `memcpy`) there will be undefined behavior. I'll take `memmove` over `memcpy` _any day_! – 0xC0000022L Aug 28 '19 at 07:46

paxdiablo · Answer 3 · 2009-12-26T00:35:37.873

7

If you're interested in which will perform better, you need to test it on the target platform. Nothing in the standard mandates how the functions are implemented and, while it may seem logical that a non-checking memcpy would be faster, this is by no means a certainty.

It's quite possible, though unlikely, that the person who wrote memmove for your particular compiler was a certified genius while the poor soul who got the job of writing memcpy was the village idiot :-)

Although, in reality, I find it hard to imagine the memmove could be faster than memcpy, I don't discount the possibility. Measure, don't guess.

edited Dec 26 '09 at 00:35

answered Dec 25 '09 at 12:47

paxdiablo

854,327
234
1,573
1,953

1

`memcpy` has the restrict qualifier on its arguments, not `memmove`. (It codifies precisely the fact that the buffers don't overlap). – Stephen Canon Dec 25 '09 at 14:39
D'Oh! You're right, of course, @StephenC, I got them the wrong way around. Removed that twaddle from my answer :-) – paxdiablo Dec 26 '09 at 00:36

score 2 · Answer 4 · answered Feb 04 '16 at 19:39

On some ARM platform im working on, memmove was 3 times faster than memcpy for short unalligned load. As memcpy and memmove are the only truly portable type-punning mechanism, you would have thought that the would be some check by the compiler before trying to use the NEON to do it.

Which one to use - memmove() or memcpy() - when buffers don't overlap?

4 Answers4

Linked