Comparing performance of two copying techniques?

Question

For copying a huge double array to another array I have following two options:

Option 1

copy(arr1, arr1+N, arr2);

Option 2

#pragma omp parallel for
for(int i = 0; i < N; i++)
    arr2[i] = arr1[i];

I want to know for a large value of N. Which of the following will be the better (takes less time) option and when?"

System configuration:
Memory: 15.6 GiB
Processor: Intel® Core™ i5-4590 CPU @ 3.30GHz × 4
OS-Type: 64-bit
compiler: gcc (Ubuntu 4.9.2-0ubuntu1~12.04) 4.9.2

Code execution speed depends, a lot, on the platform it runs on -- compiler (version), libraries (versions), cache size, number of cores, memory bus, processor clock speed, threading, .... The only practical solution is to run some tests. Get testing. — High Performance Mark, May 08 '16 at 11:19
@HighPerformanceMark please check my recent update on the question — sv_jan5, May 08 '16 at 11:33
You've misunderstood me, so I'll be explicit: your question is one which can only be answered by testing, not by discussion, opinion and argument. I'm not going to do your testing for you (not even if I had the same platform as you). I can't understand why you haven't tested this yourself. — High Performance Mark, May 08 '16 at 11:40
This question is actually quite similar to this one: http://stackoverflow.com/q/36821618/620382 — Zulan, May 08 '16 at 13:57

score 1 · Answer 1 · answered May 08 '16 at 11:54

1

Option 1 is better.

RAM is a shared resource, you can not simply parallelize it. When one core uses the RAM, the others wait.

Moreover, RAM is usually slower that the CPU -- RAM frequency is lower than CPU freqency, so in the case above even the single core has cycles that just wait on the RAM.

You also might consider memcpy() for copying, it might be faster than std::copy(). It generally depends from the implementation.

Last but not lest, always measure. For start, just put omp_get_wtime() before and after the piece of code you are measuring and see the difference.

answered May 08 '16 at 11:54

dimm

1,792
11
15

Generally speaking, this is **wrong**. On current multi-core CPUs, **one thread can not saturate the memory bandwidth**. See for example [these benchmark results](https://panthema.net/2013/pmbw/results.html#SRamBWp1). – Zulan May 08 '16 at 13:56
No it is not wrong at all. As a counter example, try measuring memcpy which runs on a *single thread*. You should get almost theoretical bandwidth, maybe up to 90%. Also did you read what says there: "The results of these scalability tests highlight the basic problem which parallel multi-core algorithms must cope with: scanning **memory bandwidth does not scale** with the number of cores in current systems." So any bandwidth above the theoretical RAM comes from the caches and repeated scanning of SMALL arrays. – dimm May 08 '16 at 23:53
"memory bandwidth does not scale with **the number of cores in current systems**." This means, with 4 cores, you do not get 4x speedup on a memory bound task. That does not imply that you get no speedup at all. Yes, memory bandwidth is often a limiting factor. However, even if the processor-memory-performance-gap is still increasing, the overall memory bandwidth is and has been increasing faster than the per-core performance. – Zulan May 09 '16 at 08:09
I actually did try your example on a i7 6700K @ 4 GHz + turbo with dual channel DDR4 PC4-17000, a current system with a very strong bias to single core performance. A single thread achieves `~9.08 GiB/s`, at three threads it peak at `~11.3 GiB/s`, a 25 % increase in performance. The total theoretical bandwidth of the system is `34 GiB/s`, but because of write allocate cache policy you end up with `11.33 GiB/s`. On anything with less core frequency (e.g. more cores) or more memory channels, the performance loss from using one thread is more drastic. – Zulan May 09 '16 at 08:27
Using 3 cores just to get 25% speedup? I'd say that is negligible. Theoretically speaking, one modern core can fully utilize one modern RAM. The difference you get is probably because that single thread doing the copying gets context switched from time to time by the OS so we get some gaps. If we use multiple threads, those gaps get filled by the other threads. But again, i'll mention, getting 1.25 speedup with 3 cores is just inefficient. – dimm May 09 '16 at 11:56
The only "real" way to parallelize memory access is to have multiples cores and multiple RAM sticks and different cores to access memory on different sticks. – dimm May 09 '16 at 12:05
Again, 25 % difference is on a system with an extreme CPU vs memory ratio. On other systems this is easily 5x or more. The difference is most certainly not from OS noise. Please try to measure yourself and let me know how you achieve maximum memory bandwidth with a single thread on a modern system. Also have a look at [this paper about the subject](https://www.usenix.org/conference/hotpower12/workshop-program/presentation/sch%C3%B6ne). – Zulan May 09 '16 at 14:56
Any reasonably configured modern system has multiple memory channels (not necessarily equal to the number of DIMMs), but that parallelism is used internally by the hardware. If you have multiple sockets with NUMA, it becomes more complicated. To get a glimpse of the complexity of modern memory subsystems, take a look at [this paper](https://dx.doi.org/10.1145/2618128.2618129). – Zulan May 09 '16 at 15:24
Indeed things really depend from the actual configuration and we are probably thinking on different configurations. I was thinking of a standard mainstream desktop or laptop with a single socket and few sticks of ram, UMA architecture, with multiple channel ram, like this one https://panthema.net/2013/pmbw/Intel-Corei7-4500U-8GB/ . If you see the parallel benchmarks, there is some small benefit of parallel access with 2 threads on large arrays, but no benefit with 3 or more. The op config is still a single socket UMA. – dimm May 10 '16 at 10:24

score 1 · Accepted Answer · edited May 23 '17 at 12:17

Practically, if performance matters, measure it.

std::copy and memcpy are usually highly optimized, using sophisticated performance tricks. Your compiler may or may not be clever enough / have the right configuration options to achieve that performance from a raw loop.

That said, theoretically, parallelizing the copy can provide a benefit. On modern systems you must use multiple threads to fully utilize both your memory and cache bandwidth. Take a look at these benchmark results, where the first two rows compare parallel versus single threaded cache, and the last two rows parallel vs. single threaded main memory bandwidth. On a desktop system like yours, the gap is not very large. In a high-performance oriented system, especially with multiple sockets, more threads are very important to exploit the available bandwidth.

For an optimal solution, you have to consider things like not writing the same cache-line from multiple threads. Also if your compiler doesn't produce perfect code from the raw loop, you may have to actually run std::copy on multiple threads/chunks. In my tests, the raw loop performed much worse, because it doesn't use AVX. Only the Intel compiler managed to actually replace parts in the OpenMP loop with an avx_rep_memcpy - interestingly it did not perform this optimization with a non-OpenMP loop. The optimal number of threads for memory bandwidth is also usually not the number of cores, but less.

The general recommendation is: Start with a simple implementation, in this case the idiomatic std::copy, and later analyze your application to understand where the bottleneck actually is. Do not invest in complex, hard to maintain, system specific optimizations that may only affect a tiny faction of your codes overall runtime. If it turns out this is a bottleneck for your application, and your hardware resources are not utilized well, then you need to understand the performance characteristics of your underlying hardware (local/shared caches, NUMA, prefetchers) and tune your code accordingly.

I reused parts of [another of my answers](http://stackoverflow.com/a/36824169/620382). In accordance with [meta](http://meta.stackoverflow.com/a/292372/620382), I didn't flag duplicate. — Zulan, May 08 '16 at 20:24

Comparing performance of two copying techniques?

Option 1

Option 2

2 Answers2

Linked