Is multi-thread memory access faster than single threaded memory access?

Question

Assume we are in C language. A simple example is as follows. If I have a gigantic array A and I want to copy A to array B with the same size as A. Is using multithreading to do memory copy faster than it with a single thread? How many threads are suitable to do this kind of memory operation?

EDIT: Let me put the question more narrow. First of all, we do not consider the GPU case. The memory access optimization is very important and effective when we do GPU programming. In my experience, we always need to be careful about the memory operations. On the other hand, it is not always the case when we work on CPU. In addition, let's not consider about the SIMD instructions, such as avx and sse. Those will also show memory performance issues when the program has too many memory access operations as opposed to a lot of computational operations. Assume that we work an x86 architecture with 1-2 CPUs. Each CPU has multiple cores and a quad channel memory interface. The main memory is DDR4, as it is common today.

My array is an array of double precision floating point numbers with the size similar to the size of L3 cache of a CPU, that is roughly 50MB. Now, I have two cases: 1) copy this array to another array with the same size using by doing element-wise copy or by using memcpy. 2) combine a lot of small arrays into this gigantic array. Both are real-time operations, meaning that they need to be done as fast as possible. Does multi-threading give a speedup or a dropdown? What's the factor in this case that affects the performance of memory operations?

Someone said it will mostly depend on DMA performance. I think it is when we do memcpy. What if we do element-wise copy, does the pass through the CPU cache first?

do your own measurements. There are so many parameters involved: cache misses, length of the data... I'd say that since it's not a CPU intensive operation, the bottleneck will be the access to the memory (a bit like disk access). I'd go for a single thread, but I'd compare with multithread to make sure... — Jean-François Fabre, Feb 07 '17 at 21:04
The memory (RAM) is not "multi-core" - it is linear array of bytes. You cannot access it at higher speed from 2,5 or 20 cores - it is shared resource. If you need to read RAM, do complex calculation, and store the result - then multicore will be faster because of time for calculations (eventually). — i486, Feb 07 '17 at 21:06
I think the correct answer to this needs to highlight it's not a binary yes/no question, but it's still a good question to ask and leads in to other questions about multithreading / CPU architecture — Keith Nicholas, Feb 07 '17 at 21:07
I want to upvote the question in the last sentence and downvote the question in the first sentence. — Drew Dormann, Feb 07 '17 at 21:08
A multi-threaded version will make execution of tasks easier during the copy. With a multi-threaded version, there is the overhead of creating threads, maintenance (running) of the threads and thread termination. My opinion is that a single thread copy would be more efficient. — Thomas Matthews, Feb 07 '17 at 21:11
In most computer architectures, the bottleneck is the data bus. Each core and device must share the same data bus to the memory. Any techniques other than direct memory access will add delays to the process. — Thomas Matthews, Feb 07 '17 at 21:15
Voted to close as too broad, but it looks salvageable. For the moment, both the top voted comment and top voted answer are "It depends", which would be my response as well. — MSalters, Feb 07 '17 at 21:23
This is one of those things where the answer is clearly "no" in the past. But it's changing now. Yes it's no doubt you will be memory-bound if all you're doing is copying memory. But it doesn't mean you can't do it faster with multiple threads. On some systems (such as multi-socket servers or HEDTs with quad-channel DDR4), it's no longer possible for a single thread to saturate all the memory bandwidth in the entire system. — Mysticial, Feb 07 '17 at 21:32
It tried it on my machine and a 4 threaded version with a threadpool was about 2.75 times as fast as a single threaded one. I guess it depends on the concurrency of the CPU (cores, multithreading) and on how it's connected to the RAM. Ulrich Drepper's infamous paper probably has an excruciatingly detailed explanation for my result. Maybe someday, I'll delve into it. — Petr Skocik, Feb 07 '17 at 21:51
@PSkocik: I got similar results with pthreads. 8GB of data took 0.83 seconds to copy unthreaded with memmove, 0.93 seconds with one pthread, and 0.35 seconds with four pthreads. (Thread creation and joining included.) — Thomas Padron-McCarthy, Feb 07 '17 at 22:33
@i486: Memory isn't "multi-core", but it's also not "single-core" or a linear array of bytes. On common systems, it's an array of DDR DIMM's behind a multi-level cache. There's no single memory bus, there are multiple buses/channels. — MSalters, Feb 08 '17 at 08:35
@MSalters OK, but this is not the common case. The original question did not have details about type of RAM, etc. E.g. Raspberry Pi has multi-core CPU - will it copy RAM faster with multi core program? — i486, Feb 08 '17 at 09:07
@PSkocik: I got a similar result to you too before I asked the question. When I tested it again after the question to double check it, it is also the same. Then I'm wondering why a DMA operation is affected by how many cores we are using in the program. Btw, can you also tell me the paper's title? — user3677630, Feb 08 '17 at 19:33
@user3677630 I don't think DMA will be used at all (http://stackoverflow.com/questions/23580242/using-dma-memory-transfer-in-user-space/23582568#23582568) so it comes down to word-by-word copying through the CPU, which is naturally sped up by multiple cores. https://en.wikipedia.org/wiki/Multi-channel_memory_architecture might play a role too. — Petr Skocik, Feb 08 '17 at 19:38

score 8 · Answer 1 · answered Feb 07 '17 at 21:12

8

It depends on many factors. One factor is the hardware you use. On modern PC hardware, multithreading will most likely not lead to performance improvement, because CPU time is not the limiting factor of copy operations. The limiting factor is the memory interface. The CPU will most likely use the DMA controller to do the copying, so the CPU will not be too busy when copying data.

answered Feb 07 '17 at 21:12

Xaver

1,035
9
17

2

Interesting enough, my simple test shows that tasks scales perfectly and 100% CPU bound (while calling memcpy essentially) http://coliru.stacked-crooked.com/a/a61707960de650d9 – Lol4t0 Feb 07 '17 at 22:00
@Anony-Mousse it calls real memcpy though. Assembly print there is just you can look up it there – Lol4t0 Feb 07 '17 at 22:13
How do you know that the CPU will use the DMA controller? I'm skeptical, but open minded. Do you have a source? – Jordan Melo Feb 07 '17 at 22:19

huseyin tugrul buyukisik · Answer 2 · 2017-03-26T11:37:13.813

Over the years, CPU performance increased greatly, literally exponentiated. RAM performance couldn't catch up. It actually made the cache more important. Especially after celeron.

So you can have increase or decrease in performance:

Depending heavily on

memory fetch and memory store units per core
memory controller modules
pipeline depths of memory modules and enumeration of memory banks
memory accessing patterns of each thread(software)
Alignments of data chunks, instruction blobs
Sharing and its datapaths of common hardware resources
Operating system doing too much preemption for all threads

Simply optimize the code for cache, then the quality of cpu will decide the performance.

Example:

FX8150 has weaker cores than a i7-4700:

FX cores can have scaling with extra threads but i7 tops with just single thread (I mean memory-heavy codes)
FX has more L3 but it is slower
FX can work with higher frequency RAM but i7 has better inter-core data bandwidth (incase of 1 thread sending data to another thread)
FX pipeline is too long, too long to recover after a branch

it looks like AMD can share more finer-grained performance to threads while INTEL does give power to a single thread. (council assembly vs monarchy) Maybe thats why AMD is better at GPU and HBM.

If I had to stop speculation, I would care only for cache as it is not alterable in cpu while RAM can have many combinations on a motherboard.

score 1 · Answer 3 · answered Mar 26 '17 at 10:37

Assuming AMD/Intel64 architecture.

One core is not capable of saturating the memory bandwidth. But this means not that multi-threaded is faster. For that the threads must be on different cores, launching as many threads as there is physical cores should give a speed up as the OS would most likely assign the threads to different cores, but in you threading library there should be a function binding a thread to a specific core, using this is the best for speed. Another thing to think about is NUMA, if you have a multi socket system. For maximum speed you should also think about using AVX instructions.

Is multi-thread memory access faster than single threaded memory access?

3 Answers3

Linked