5

I am currently thinking of reasons why multi threaded applications may not scale well.

Two reasons I am aware of and that I have been fighting with are:

  1. Communication between threads is not done well and slows down the speed
  2. Number of cores on a chip and memory bandwith to the cpu do not increase proportionally. This leads to a slower memory bandwith per core the more cores on a chip are heavily used.

What else are problems?

Tim Cooper
  • 10,023
  • 5
  • 61
  • 77
Erik
  • 11,944
  • 18
  • 87
  • 126
  • `Communication between processes...` you said multi-THREADED – UmNyobe Apr 27 '12 at 08:51
  • ..probably a typo/braino. I'm sure OP meant 'between threads' :) – Martin James Apr 27 '12 at 08:52
  • 1
    okay, to be correct, we have to either stick to processes or threads. anyway, as I am more interested in the theory than in practical correctness - please forgive me :) – Erik Apr 27 '12 at 11:56
  • I’d add (3) the threads all need to access some shared resource (eg a mutex or the hard drive or etc) and they end up taking turns accessing it, with most of the threads blocked waiting for their turn a lot of the time. – Jeremy Friesner Jul 26 '21 at 14:28

5 Answers5

2

For point 1), they are not necessarily 'not done well', but in most cases there are critical sections that processes/threads have to wait for each other, e.g. update some critical data. This is described well by Amdahl's law.

Another point I'd like to add is the scalability of the task itself. If the task (the input) is not scalable, then increasing processing power (cores/threads) cannot improve the whole throughput. For example, an application is to handle data flows, but there is a constraint that data packets from same flow can not be handled in parallel (due to ordering consideration), then the scalability will be limited by the number of flows.

In addition, the scalability of the algorithm is even more fundamental, considering the difference between O(1) and O(n) algorithms. Of course, maybe the topic here focus on scalability of processing power, rather than data size.

Han Zhou
  • 416
  • 4
  • 3
1

I think that, in (1), you've nailed one of most important factors that can negatively influence the performance of multithreaded apps. Esp. Google for 'false sharing'.

(2), however only affects a set of multithreaded apps - those that that run CPU-bound threads in parallel. If an app uses many threads that are I/O bound, (2) does not matter too much.

Looking at my box here, it has 100 processes and 1403 threads, CPU use 3%. Only 7 out of the 100 processes are single-threaded. Most of the apps, therefore, are multithreaded but I/O waiting.

My box would work reasonably well, at the moment, if it had only one core. Sure, hitting a link that winds up my browser would probably be a bit slower to bring up a complex page, but not much.

In the commonest case then, where apps are multithreaded to take avantage of the high I/O performance of preemptive multitaskers, apps scale very well indeed, even on a single-core CPU.

Try not to fall into the trap of thinking that preemptive multitasking OS are all about 'doing CPU-bound tasks in parallel' - they actually make this difficult by forcing the need for locking, synchro, signalling etc. It's much more about high-performance I/O, something that a cooperative scheduler is spectacularly bad at.

Martin James
  • 24,453
  • 3
  • 36
  • 60
  • Thank you! Great points, especially interesting when you talk about the scheduler – Erik Apr 27 '12 at 11:58
  • The last cooperatively-scheduled OS I used was Win 3.1 :((( Every time I listen to Florence and 'Dog days are over', I think about Win 3.1, 'Shake it out'-> Win 95, 'Remain Nameless'-> Win ME. Windows boxes became useful with NT/W2K - full prioritised preemptive schedulers. – Martin James Apr 27 '12 at 12:22
1

Many multi-threaded applications are built around the "one user one thread" concept which means that once a user or chore needs to be handled a thread is allocated to the task. Every extra thread increases the load on the scheduler leading up to the point where all processing is done trying to determine which thread should be run at this moment. Call this "scheduler saturation."

Windows (the multi-threaded engine, not 95/98/Me etc) has a mechanism called I/O Completion ports which recommend one thread per processor for best performance. IOCP-based applications are usually tremendously fast though, as always, the bottlenecks instead appear in other places such as running out of certain types of OS memory or waiting on the communications medium.

You can search for IOCP here at SO, it has its own tag.

Olof Forshell
  • 3,169
  • 22
  • 28
0

I would add:

  1. The more threads, the smaller their share of CPU cache. A typical modern CPU's might have 3 levels of cache: L1, L2 and L3. L1 might be private to that core, but L2 and L3 might be shared between cores on the die or something. So a single thread can use the entire L2 & L3, but if you have many threads then you get many more cache misses, depending on the profile of your algorithm.

See also:

many-core CPU's: Programming techniques to avoid disappointing scalability

Community
  • 1
  • 1
Tim Cooper
  • 10,023
  • 5
  • 61
  • 77
0

It could be limited by the fixed maximum bandwidth of main memory, where your program has run out of the memory bandwidth, and however you make more thread can't create more available memory bandwidth. This is related to your specific application, whether is a memory bounded one or a compute bounded one, see roofline model.

Zhaoyang
  • 49
  • 8