Performance of multi-threading exceeding cores

Question

If I have a process that starts X amount of threads, will there ever be a performance gain having X higher than the number of CPU cores (assuming all the threads are working synchronously without async calls to storage/network)?

E.G. If I have a two cores CPU, will I just slow down the application starting 3+ constantly working threads?

score 2 · Accepted Answer · answered May 24 '17 at 14:25

2

It really depends on what your code does. it is too broad.

Having more threads than cores might speed up the program for example if some of the threads sleep or try to block on a lock. in this case, the OS scheduler can wake different thread and that thread will work while the other thread is sleeping.

Having more threads than the number of cores may also decrease the program execution time because the OS scheduler has to do more work to switch between the threads execution and that scheduling might be a heavy operation.

As always, benchmarking your application with different amount of threads is the best way to achieve maximum performance. there are also algorithms (like Hill-Climbing) which may help the application fine tune the best number of threads on runtime.

answered May 24 '17 at 14:25

David Haim

25,446
3
44
78

I was thinking about the simple scenario where all threads have instructions to actually execute, but you also have a good point about locks that I did not consider. Should they have just instructions to execute, would it be then true that more means worse? – user4388177 May 24 '17 at 14:28
@user4388177 usually, if all your threads are 100% cpu busy then yes, usually the more threads you have the more context switches the OS has to do, more cache faults and it usually leads to performance degration. – David Haim May 24 '17 at 14:31
But as they might be waiting for external resources/sleeping/waiting for locks it could very well be that more thread improve performances as the ones not executing instructions can be put aside by the machine while it works on something else. Am I correct? – user4388177 May 25 '17 at 13:48

score 2 · Answer 2 · answered May 24 '17 at 14:29

It is possible that such a thing happens. Both Intel and AMD currently implement forms of SMT in their CPUs. This means that, in general, one single thread of execution may not be able to exploit 100% of the computing resources. This happens because modern CPUs execute instructions in multiple pipelined steps, so that the clock frequency can be increased (less stuff gets done in every cycle, so you can do more cycles). The downside of this approach is that, if you have two consecutive instructions A and B, with the latter depending on the result of the former, you may have to wait some clock cycles without doing anything, just waiting for instruction A to complete. So, they came up with SMT, which allows the CPU to interleave instructions from two different threads/processes on the same pipeline, in order to fill such gaps.

Note: it is not exactly like this, CPUs don't just wait. They try to guess the result of the first operation and execute the second assuming that result. If their guess is wrong, they cancel the pending instructions and start over. Also, they have some feedback circuits that allow tighter execution of interdependent instructions. And nowadays branch predictors are surprisingly good. Things get better for the pipeline if you can just fill gaps with instructions from some other process, rather than going with a guess, but this potentially halves the amount of cache each executing thread can use.

Thanks, you all gave me good answers and each of them has some useful extra information, I wish I could choose all of them as correct. I will pick the earliest one. — user4388177, May 25 '17 at 13:54

Andriy Berestovskyy · Answer 3 · 2017-05-24T14:29:59.690

1

It makes sense to run more threads if your threads make read/write/send/recv syscalls or similar, or sleep on locks, etc.

If your threads are pure computation threads, adding more of them will slow down system because of context switches.

If you still need more threads by design, you might want to look into the cooperative multitasking. Both Windows and Linux have API for that and that will work faster than the context switches. In Windows it called fibers:

https://msdn.microsoft.com/en-us/library/windows/desktop/ms682661(v=vs.85).aspx

In Linux it is a set of functions make/get/swapcontext():

http://man7.org/linux/man-pages/man3/makecontext.3.html

edited May 24 '17 at 14:29

answered May 24 '17 at 14:26

Andriy Berestovskyy

8,059
3
17
33

Thanks, you all gave me good answers and each of them has some useful extra information, I wish I could choose all of them as correct. I will pick the earliest one. – user4388177 May 25 '17 at 13:53

score 1 · Answer 4 · answered May 24 '17 at 15:15

1

This question: Optimal number of threads per core might help you.

In the thread I wrote an answer describing a scenario when having higher number of threads than the available number of cores boosts performance.

answered May 24 '17 at 15:15

someneat

348
3
9

Thanks, you all gave me good answers and each of them has some useful extra information, I wish I could choose all of them as correct. I will pick the earliest one. – user4388177 May 25 '17 at 13:54

Performance of multi-threading exceeding cores

4 Answers4

Linked