When is memcpy faster than simple repeated assignment?

Question

Assume that one wants to make a copy of an array declared as

DATA_TYPE src[N];

Is memcpy always as fast as or faster than the following code snippet, regardless of what DATA_TYPE and the number of elements of the array are?

DATA_TYPE dest[N];

for (int i=0; i<N; i++)
    dest[i] = src[i];

For a small type like char and large N we can be sure that memcpy is faster (unless the compiler replaces the loop with a call to memcpy). But what if the type is larger, like double, and/or the number of array elements is small?

This question came to my mind when copying many arrays of doubles each with 3 elements.

I didn't find an answer to my question in the answer to the other question mentioned by wohlstad in the comments. The accepted answer in that question essentially says "leave it for the compiler to decide." That's not the sort of answer I'm looking for. The fact that a compiler can optimize memory copying by choosing one alternative is not an answer. Why and when is one alternative faster? Maybe compilers know the answer, but developers, including compiler developers, don't know!

Does this answer your question? [memcpy vs assignment in C](https://stackoverflow.com/questions/324011/memcpy-vs-assignment-in-c) — wohlstad, Apr 23 '22 at 07:36
Use http://godbolt.org to examine the alternative codes. Chances are, the optimizer will produce almost identical assembly for both. — hyde, Apr 23 '22 at 07:38
@wohlstad, The accepted answer in that question essentially says "leave it for the compiler to decide." That's not the sort of answer I'm looking for. — apadana, Apr 23 '22 at 07:48
@wohlstad, the fact that a compiler can optimize memory copying by choosing one alternative is not an answer. Why and when is one alternative faster? Maybe compilers know the answer, but developers, including compiler developers, don't know! — apadana, Apr 23 '22 at 07:54
Many implementations of `memcpy` has optimizations to copy word sized elements on word boundaries. For example, if you `DATA_TYPE` was a byte, then your manual loop, assuming no optimizations, will do N iterations. On a 64-bit machine, memcpy may do the entire job in N/8 iterations - copying over 64-bits at a time. — selbie, Apr 23 '22 at 08:00
What sort of answer *are* you looking for? There is no silver bullet to which is "better". The variables involved are too diverse and situationally dependent. The only way you're going to know for *certain* is to produce your optimized asm, count the clock cycles and pipeline stalls, including potential branch prediction failures, cache locality, and above all of that, *measure*. — WhozCraig, Apr 23 '22 at 08:02
"for a small type ... we can be sure that `memcpy` is faster" - No, you cannot. — Cheatah, Apr 23 '22 at 08:22
The general rule is: unless you have a specific need for optimization, write the code so that it is as easy to read and maintain as possible. When there is a need for optimization, the solution is highly dependent on the target system. — nielsen, Apr 23 '22 at 08:39

score 3 · Accepted Answer · answered Apr 23 '22 at 08:09

3

Since memcpy is a library function, it is entirely dependent on the library implementation how efficient it actually is and no definitive answer is possible.

That said, any provided standard library is likely to be highly optimised and may even use hardware specific features such as DMA transfer. Whereas your code loop performance will vary depending on the optimisation settings, so is likely to perform much worse in unoptimised debug builds.

Another consideration is that the performance of memcpy() will be independent of data type and generally deterministic, whereas your loop performance is likely to vary depending on DATA_TYPE, or even the value of N.

Generally, I would expect memcpy() to be optimal and faster or as fast as an assignment loop, and certainly more consistent and deterministic, being independent of specific compiler settings, and even the compiler used.

In the end, the only way to tell is to measure it for your specific platform, toolchain, library and build options, and also for various data types. Ultimately since you would have to measure it for every usage combination to know if it were faster, I suggest that it is generally a waste of time, and of academic interest only - use the library - not only for performance and consistency, but also for clarity and maintainability.

answered Apr 23 '22 at 08:09

Clifford

88,407
13
85
165

Would you yourself use `memcpy` for coppying an array of 3 `double`s? – apadana Apr 23 '22 at 08:30
@apadana Possibly not because for that the function call overhead would be significant, but I would probably not use a loop either for the same reason. While an optimiser might unroll such a loop, it is by no means a given and unrolling it yourself would lead to better performance in unoptimised builds. – Clifford Apr 23 '22 at 08:37
Good answer. Now that you mention DMA, it is a fun fact that DMA transfer may - depending on the platform - be slower than e.g. `memcpy()`, but it will allow the processor to do other things in the meantime. In such a case, "faster" depends on whether or not the application is able to make use of this offloading. – nielsen Apr 23 '22 at 08:44
@apadana : generally though, I would not sweat the small stuff. Such micro optimisations are seldom productive or necessary, and difficult to generalise. The only time I recall needing to worry about this sort of thing is in DSP code on a microcontroller based hard real-time system. And there I tested and profiled various implementations to determine the best solution for that specific application on that specific platform ant toolchain. – Clifford Apr 23 '22 at 08:49
1

@nielsen : yes, in practice I doubt an implementation would use DMA as it is less deterministic and not portable. Also for small copies, the setup time would be prohibitive. Instead specialised DMA mem copy functions might be provided. – Clifford Apr 23 '22 at 08:54

When is memcpy faster than simple repeated assignment?

1 Answers1