optimized memcpy

Question

Are there faster alternatives to memcpy() in C++?

If there was a faster way, why wouldn't they use it in `memcpy` implementation? — Mehrdad Afshari, Jul 30 '09 at 21:38
@MehrdadAfshari: The `memcpy` function can be invoked with pointers of arbitrary alignment, to things of arbitrary PODS type, and may arbitrarily alias any PODS objects whose address has been exposed to outside code. Given `struct fnord a,b; void * volatile p=&a,*volatile q=&b;` I would expect `*((struct fnord*)p)=*((struct fnord*)q);` to perform much better than `memcpy(p,q, sizeof (struct fnord));` since in the former case a compiler could legitimately assume p and q will be aligned for a `struct fnord` and won't alias anything else, but in the latter case it cannot. — supercat, Apr 13 '16 at 19:30

score 23 · Accepted Answer · answered Jul 30 '09 at 21:39

23

Unlikely. Your compiler/standard library will likely have a very efficient and tailored implementation of memcpy. And memcpy is basically the lowest api there is for copying one part of memory to another.

If you want further speedups, find a way to not need any memory copying.

answered Jul 30 '09 at 21:39

nos

223,662
58
417
506

actually, there is at least one alternative that'll be faster in *some* cases at least, and should never be slower. See my answer. :) – jalf Jul 30 '09 at 21:56
-1: it's well known that GCC builtin functions suck (see Agner Fog's benchmarks). Well, maybe it has been finally fixed, but it illustrates the point that library are *not* necessarily optimized. – Bastien Léonard Jul 30 '09 at 22:58
@Bastien - could you provide a pointer to the Agner Fog benchmarks? I see there's a lot of information on his site about optimization, but I couldn't find any clear-cut benchmarks (except one table that compared some memcpy() & strlen() routines, and as far as I can tell the intrinsic support for the routines was turned off). – Michael Burr Jul 30 '09 at 23:36
2

@Michael: see the discussion that Agner created on GCC's mailing list: http://gcc.gnu.org/ml/gcc/2008-07/msg00410.html. – Bastien Léonard Jul 31 '09 at 15:44
Thanks for the pointer - I wonder if Fog's testing of instrinsic memcpy/memset code generation was targeted/tuned to generic/i386 or was -march and/or -mtune used? There might be some experiments on my machine in the near future... – Michael Burr Jul 31 '09 at 18:38

jalf · Answer 2 · 2011-03-25T09:44:55.167

First, a word of advice. Assume that the people who wrote your standard library are not stupid. If there was a faster way to implement a general memcpy, they'd have done it.

Second, yes, there are better alternatives.

In C++, use the std::copy function. It does the same thing, but it is 1) safer, and 2) potentially faster in some cases. It is a template, meaning that it can be specialized for specific types, making it potentially faster than the general C memcpy.
Or, you can use your superior knowledge of your specific situation. The implementers of memcpy had to write it so it performed well in every case. If you have specific information about the situation where you need it, you might be able to write a faster version. For example, how much memory do you need to copy? How is it aligned? That might allow you to write a more efficient memcpy for this specific case. But it won't be as good in most other cases (if it'll work at all)

Its unlikely that the compiler actually calls a memcpy function. I know that in gcc it doesnt, but actually replaces memcpy with a single instruction on i386. — Paul Biggar, Jul 30 '09 at 22:17
@PaulBiggar: For POD types, GCC's std::copy will call `memmove`. If you provide aliasing hints with `__restrict` then it will call `memcpy`. — Peter Alexander, Oct 03 '12 at 21:25

Bastien Léonard · Answer 3 · 2009-07-30T23:01:49.737

11

Optimization expert Agner Fog has published optimized memory functions: http://agner.org/optimize/#asmlib. It's under GPL though.

Some time ago Agner said that these functions should replace GCC builtins because they're a lot faster. I don't know if it's been done since then.

edited Jul 30 '09 at 23:01

answered Jul 30 '09 at 22:41

Bastien Léonard

60,478
20
78
95

score 8 · Answer 4 · edited May 23 '17 at 12:10

8

This answer for a very simiar question (about memset()) applies to here, too.

clearing a small integer array: memset vs. for loop

It basically says that compilers generate some very optimal code for memcpy()/memset() - and different code depending on the nature of the objects (size, alignment, etc).

And remember, only memcpy() PODs in C++.

edited May 23 '17 at 12:10

Community

1
1

answered Jul 30 '09 at 21:37

Michael Burr

333,147
50
533
760

Maxim Masiutin · Answer 5 · 2021-01-19T20:56:18.763

In order to find or write a fast memory copy routine, we should understand how processors work.

Processors since Intel Pentium Pro do “Out-of-order execution”. They may execute many instructions in parallel if the instructions don’t have dependencies. But this is only the case when the instructions operate with registers only. If they operate with memory, additional CPU units are used, called “load units” (to read data from memory) and “store units” (to write data to memory). Most CPUs have two load units and one store unit, i.e. they can execute in parallel two instructions that reads from memory and one instruction that writes into memory (again, if they don’t affect each other). The size of these units is usually the same as the maximum register size – if the CPU has XMM registers (SSE) – it’s 16 bytes, if it has YMM registers (AVX) – it is 32 bytes, and so on. All instructions that read or write memory are translated to micro-operations (micro-ops) that go to the common pool of micro-ops and wait there for the load and store units to be able to serve them. A single load or store unit can only serve one micro-op at a time, regardless of the data size it needs to load or store, be it 1 byte or 32 bytes.

So, fastest memory copy would be move to and from registers with maximum size. For AVX-enabled processors (but without AVX-512), fastest way to copy memory would be to repeat the following sequence, loop-unrolled:

vmovdqa     ymm0,ymmword ptr [rcx]
vmovdqa     ymm1,ymmword ptr [rcx+20h]
vmovdqa     ymmword ptr [rdx],ymm0
vmovdqa     ymmword ptr [rdx+20h],ymm1

The Google code posted earlier by hplbsh is not very good, because they use all 8 xmm registers to hold the data before they begin to write it back, while it is not needed – since we only have two load units and one store unit. So just two registers give best results. Using that many registers in no way improves performance.

A memory copy routine may also use some “advanced” techniques like “prefetch” to instruct the processor to load memory into cache in advance and “non-temporal writes” (if you are copying very large memory chunks and don’t need the data from the output buffer to be immediately read), aligned vs unaligned writes, etc.

Modern processors, released since 2013, if they have the ERMS bit in the CPUID, have so-called “enhanced rep movsb”, so for large memory copy, the “rep movsb” may be used – the copy will be very fast, even faster than with the ymm registers, and it will work with cache properly. However, startup costs of this instruction are very high – about 35 cycles, so it pays up only on large memory blocks (however, this may change in future processors). See “The explanation on relative performance” section on https://stackoverflow.com/a/43845229/6910868 and also see https://stackoverflow.com/a/43837564/6910868 for more information on “rep movsb”.

I hope it should now be easier for you to choose or write the best memory copy routine needed for your case.

You can even keep the standard memcpy/memmove, but get your own special largememcpy() for your needs.

score 2 · Answer 6 · edited Jul 30 '09 at 22:26

I'm not sure that using the default memcpy is always the best option. Most memcpy implementations I've looked at tend to try and align the data at the start, and then do aligned copies. If the data is already aligned, or is quite small, then this is wasting time.

Sometimes it's beneficial to have specialized word copy, half word copy, byte copy memcpy's, as long as it doesn't have too negative an effect on the caches.

Also, you may want finer control over the actual allocation algorithm. In the games industry it's exceptionally common for people to write their own memory allocation routines, irrespective of how much effort was spent by the toolchain developers in the first place developing it. The games I've seen almost always tend to use Doug Lea's Malloc.

Generally speaking though, you'd be wasting time trying to optimize memcpy as there'll no doubt be lots of easier bits of code in your application to speed up.

score 1 · Answer 7 · answered Jul 30 '09 at 21:43

1

Depending on what you're trying to do... if it's a big enough memcpy, and you are only be writing to the copy sparsely, a mmap with MMAP_PRIVATE to create a copy-on-write mapping could conceivably be faster.

answered Jul 30 '09 at 21:43

smcameron

1,307
8
8

And the copy on write stuff will only work if the address space is in a different process (came back to say that.) Actually I don't think you have to write it to a file if you use MAP_ANONYMOUS flag. – smcameron Jul 30 '09 at 21:51
3

no, memory mapping can be used between two memory locations as well – jalf Jul 30 '09 at 21:52
1

It hinges on the "depending what you're trying to do." If say, he's got 1Gb of memory that he's going to copy, and then maybe he's only going to modify a few kbytes of it, but doens't know which ahead of time, then doing the mmap involves only creating new virtual mapping to the same memory, which, in principle, could be faster than copying 1Gb. then if they're copy-on-write, only the pages touched by the few kbytes modifications would actually get copied by the virtual memory system. So, kind of a long shot that it would be faster, and depends on what he's doing. – smcameron Aug 03 '09 at 15:12
creating such mmap will be fast, but it just will hide memcpy and do it a bit later, when mmaped memory will be writed. And this copying will be initiated as software interrupt, which is a very slow (comparing to memcpy) – osgx Mar 03 '10 at 08:34

score 1 · Answer 8 · answered Jul 30 '09 at 21:43

Depending on your platform there may be for specific use cases, like if you know the source and destination are aligned to a cache line and the size is an integer multiple of the cache line size. In general most compilers will produce fairly optimal code for memcpy though.

optimized memcpy

8 Answers8

Linked