Practically, if performance matters, measure it.
std::copy
and memcpy
are usually highly optimized, using sophisticated performance tricks. Your compiler may or may not be clever enough / have the right configuration options to achieve that performance from a raw loop.
That said, theoretically, parallelizing the copy can provide a benefit. On modern systems you must use multiple threads to fully utilize both your memory and cache bandwidth. Take a look at these benchmark results, where the first two rows compare parallel versus single threaded cache, and the last two rows parallel vs. single threaded main memory bandwidth. On a desktop system like yours, the gap is not very large. In a high-performance oriented system, especially with multiple sockets, more threads are very important to exploit the available bandwidth.
For an optimal solution, you have to consider things like not writing the same cache-line from multiple threads. Also if your compiler doesn't produce perfect code from the raw loop, you may have to actually run std::copy
on multiple threads/chunks. In my tests, the raw loop performed much worse, because it doesn't use AVX. Only the Intel compiler managed to actually replace parts in the OpenMP loop with an avx_rep_memcpy
- interestingly it did not perform this optimization with a non-OpenMP loop. The optimal number of threads for memory bandwidth is also usually not the number of cores, but less.
The general recommendation is: Start with a simple implementation, in this case the idiomatic std::copy
, and later analyze your application to understand where the bottleneck actually is. Do not invest in complex, hard to maintain, system specific optimizations that may only affect a tiny faction of your codes overall runtime. If it turns out this is a bottleneck for your application, and your hardware resources are not utilized well, then you need to understand the performance characteristics of your underlying hardware (local/shared caches, NUMA, prefetchers) and tune your code accordingly.