1

I am aware of the bottleneck for parallel programs, including the memory access speed limit that causes multithread programs to run slower than our expected running time, or even slower than its sequential version. I was wondering if the reverse can be possible (explained as follows).

Specifically, I was wondering if there is a case, where a multithread program with k threads runs more than k times faster than the sequential version of the program on the same machine. Let's say that the sequential version of the program takes 100 seconds to do the task, while the multithread version using 5 threads takes 10 seconds to finish.

I assume that both programs have the same algorithms, the same data-structures, the same implementation, and the same compiler with the same optimization options.

One possible case could be the hardware which can perform better for multithread programs, but I am not aware of such hardware (the question is whether there exist any). The other case could be the software implementation in lower levels. For example the compiler having a better optimization for the multithread version, but does such case exist in real compilers?

Edit: One indication for this answer is the AMD that is said to perform better on multithread tasks, rather than single thread tasks. But how is it done? (Difference between intel and AMD multithreading)

Community
  • 1
  • 1
Fazeleh
  • 1,008
  • 2
  • 9
  • 24
  • It primarily depends on the number of available cores, and how the operating system schedules the threads to them. Also the way the code is written regarding result synchronizations highly influences the overall performance. – πάντα ῥεῖ Jun 26 '16 at 08:14
  • @πάνταῥεῖ I understand that all the reasons that you mentioned can cause overhead and reduce the speed of the multithread program. But how can they increase the performance as I explained? Also, when we are using `k` cores, it really doesn't matter if we have k^2 cores, as it will always use no more than `k` cores (in this case, obviously the number of cores is more than `k`). – Fazeleh Jun 26 '16 at 08:18

2 Answers2

1

It is absolutely possible. One case that I can think of is when the program is

  1. I/O-bound rather than CPU-bound and
  2. the slowness of the I/O subsystem is amplified by the random data access pattern, which confuses the pefetching heuristic.

Under such conditions, the single threaded version of the program is frequently stalled until the necessary data is provided by the I/O subsystem (from RAM, disk, database, etc). In case of multiple threads it may happen that the random data accesses by different threads are interleaved to form a perfectly predictable pattern (e.g. ideal sequential access), which enables the prefetching heuristic boost the throughput of the I/O subsystem by orders of magnitude and almost completely eliminate the I/O waiting time.

While such a strange speedup may be a result of lucky timing between different threads, it is in fact not due to the introduction of multithreading per se, but could be achieved more deterministically by optimizing the algorithm for data access efficiency. However, in some cases, it is easier to introduce a helper thread that mimics the data access pattern of the real worker thread, and let the helper run (as a well informed prefetcher) slightly ahead of the worker thread so that the latter doesn't have to wait for the data. In those cases, multithreading can be considered as the real tool speeding up the program due to the described effect.

Community
  • 1
  • 1
Leon
  • 31,443
  • 4
  • 72
  • 97
  • This is a completely reasonable answer. For example, if there memory access is slow and the cache memory of the CPU is not very large, then this is very likely to happen. Is it the case for AMD cpus? – Fazeleh Jun 27 '16 at 09:57
  • @ReD This is not very likely to happen, unless your multithreading code is carefully crafted with this effect in mind. AMD CPUs are subject to this effect as any other CPU with a hierarchical memory model (i.e. a couple of levels of CPU caches, RAM, external memory (local disk, network storage)). – Leon Jun 27 '16 at 10:23
0

Unlikely, but marginally possible on a system where thread switches are expensive and inter-thread communication is cheap. That of course is rare, the performance of those two types of operation is generally correlated.

MSalters
  • 173,980
  • 10
  • 155
  • 350