0

It is a somewhat practical question for developpers that are used with using multithreading for intensive calculations.

On a machine having a typical architecture with an Intel or AMD multicores processor, is it efficient to use multi-threading for repeating a simple calculus on a large area of memory ?

For instance, imagine that I want to increment a huge array of integers (or make some very simple operation on them) and share the workload between different threads having each its sub-array.

Depending on the number of cores of the processor and whether it is hyperthreaded or not, the machine can have a number N of simultanous threads. Can the speed of my calculus be multiplied by something close to N ? Or will a bottleneck in RAM access arises much sooner ?

A typical machine my company can rent has N = 40. But if the bottleneck arises for 5 threads, those machines won't be useful for our aim.

I know that theoretically, RAM access can be a bottleneck, but I would need practical experience feedback for the same kind of fast operations repeated on a large memory.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847

1 Answers1

2

It depends on the specifics of the machine architecture and configuration. For something like incrementing a huge array of integers, though, you can usually saturate the memory bus before you run out of cores, so memory becomes the bottleneck.

You can figure out the theoretical memory bandwidth of the machine from its detailed specs, and then you can expect to get somewhere between 80 and 100% of that in real life by multithreading.

Matt Timmermans
  • 53,709
  • 3
  • 46
  • 87
  • Also worth mentioning, Intel "client" chips (desktop/laptop e.g. quad-core) can nearly saturate their DRAM controllers with just a single core. ([Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?](https://stackoverflow.com/q/39260020)). But if your data can be hot in caches, especially having different cores doing multiple passes over a chunk of an array that fits in L2-cache, each core can access its own private L2 cache. (So cache-blocking is a very valuable optimization, letting your bandwidth scale ~linearly with number of cores.) – Peter Cordes Apr 01 '21 at 18:57
  • https://software.intel.com/content/www/us/en/develop/articles/cache-blocking-techniques.html / [What Every Programmer Should Know About Memory?](https://stackoverflow.com/a/47714514) has an example of doing that for an NxN matmul with SIMD. Also https://en.wikipedia.org/wiki/Loop_nest_optimization describes it. – Peter Cordes Apr 01 '21 at 18:58