What is the best way to determine the number of threads to fire off in a machine with n cores? (C++)

Question

I have a vector<int> with 10,000,000 (10 million) elements, and that my workstation has four cores. There is a function, called ThrFunc, that operates on an integer. Assume that the runtime for ThrFunc for each integer in the vector<int> is roughly the same.

How should I determine the optimal number of threads to fire off? Is the answer as simple as the number of elements divided by the number of cores? Or is there a more subtle computation?

Editing to provide extra information

No need for blocking; each function invocation needs only read-only access

That would be a lot of threads! I think you meant the number of cores, right? — Sergey Kalinichenko, Jan 17 '12 at 02:08
Assuming that all of the operations on the integers can happen completely concurrently, you simply divide by the # of cores. It is much harder to estimate when the work can't be done concurrently. — Hunter McMillen, Jan 17 '12 at 02:08
Are these threads doing any (blocking) I/O or any blocking operation such as network communications or database? If no, then it's likely the optimal number of cores is N. In your case, 4. Otherwise, 2N or 3N would be worth experimenting with - while one thread is doing I/O, another thread can do work. — selbie, Jan 17 '12 at 02:09
10M divided by 4 is 2.5 million. You'll be out of memory if you try to create that many threads. But your PC will likely come to a standing halt long before then. :) — selbie, Jan 17 '12 at 02:12
See also http://stackoverflow.com/questions/481970/how-many-threads-is-too-many/481979#481979 — paxdiablo, Jan 17 '12 at 02:13
@dasblinkenlight, yes I did mean the number of cores. Sorry. — Shredderroy, Jan 17 '12 at 02:15
One simple question: do you plan to be doing anything else on your machine while it's churning ? Will there be many services running ? In this case, you might want a core available for those. — Matthieu M., Jan 17 '12 at 07:39

score 25 · Accepted Answer · answered Jan 17 '12 at 02:10

The optimal number of threads is likely to be either the number of cores in your machine or the number of cores times two.

In more abstract terms, you want the highest possible throughput. Getting the highest throughput requires the fewest contention points between the threads (since the original problem is trivially parallelizable). The number of contention points is likely to be the number of threads sharing a core or twice that, since a core can either run one or two logical threads (two with hyperthreading).

If your workload makes use of a resource of which you have fewer than four available (ALUs on Bulldozer? Hard disk access?) then the number of threads you should create will be limited by that.

The best way to find out the correct answer is, with all hardware questions, to test and find out.

If your calculations will be using the same data on each thread it'd probably be best to ignore hyperthreading, or even disable it completely. The data for both threads will likely be cached quite quickly, thus neither will stall, thus HT will never have time to actually do anything. — edA-qa mort-ora-y, Jan 17 '12 at 08:25

score 12 · Answer 2 · edited May 23 '17 at 12:16

Borealid's answer includes test and find out, which is impossible to beat as advice goes.

But there's perhaps more to testing this than you might think: you want your threads to avoid contention for data wherever possible. If the data is entirely read-only, then you might see best performance if your threads are accessing "similar" data -- making sure to walk through the data in small blocks at a time, so each thread is accessing data from the same pages over and over again. If the data is completely read-only, then there is no problem if each core gets its own copy of the cache lines. (Though this might not make the most use of each core's cache.)

If the data is in any way modified, then you will see significant performance enhancements if you keep the threads away from each other, by a lot. Most caches store data along cache lines, and you desperately want to keep each cache line from bouncing among CPUs for good performance. In that case, you might want to keep the different threads running on data that is actually far apart to avoid ever running into each other.

So: if you're updating the data while working on it, I'd recommend having N or 2*N threads of execution (for N cores), starting them with SIZE/N*M as their starting point, for threads 0 through M. (0, 1000, 2000, 3000, for four threads and 4000 data objects.) This will give you the best chance of feeding different cache lines to each core and allowing updates to proceed without cache line bouncing:

+--------------+---------------+--------------+---------------+--- ...
| first thread | second thread | third thread | fourth thread | first ...
+--------------+---------------+--------------+---------------+--- ...

If you're not updating the data while working on it, you might wish to start N or 2*N threads of execution (for N cores), starting them with 0, 1, 2, 3, etc.. and moving each one forward by N or 2*N elements with each iteration. This will allow the cache system to fetch each page from memory once, populate the CPU caches with nearly identical data, and hopefully keep each core populated with fresh data.

+-----------------------------------------------------+
| 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 ... |
+-----------------------------------------------------+

I also recommend using sched_setaffinity(2) directly in your code to force the different threads to their own processors. In my experience, Linux aims to keep each thread on its original processor so much it will not migrate tasks to other cores that are otherwise idle.

Thanks a lot for your explanations. About the last sentence: Does it matter if I am on Windows 7 or Windows Server 2008 R2? — Shredderroy, Jan 17 '12 at 02:59
@Shredderroy: it matters in that the `sched_setaffinity(2)` is Unix (or is it Linux ?) specific, on Windows it will be a different function. — Matthieu M., Jan 17 '12 at 07:38
@Shredderroy, Matthieu is correct; Windows may do a better job balancing tasks among CPUs than Linux anyway. Test test test. :) — sarnold, Jan 18 '12 at 00:05

score 5 · Answer 3 · answered Jan 17 '12 at 02:10

5

Assuming ThrFunc is CPU-bound then you want probably one thread per core, and divide the elements between them.

If there's an I/O element to the function then the answer is more complicated, because you can have one or more threads per core waiting for I/O while another is executing. Do some tests and see what happens.

answered Jan 17 '12 at 02:10

Andrew Cooper

32,176
5
81
116

Assuming you don't want to do anything else with your machine of course :-) – paxdiablo Jan 17 '12 at 02:12
@paxdiablo - Of course, although the OS will give some CPU time to other processes. – Andrew Cooper Jan 17 '12 at 02:14

score 4 · Answer 4 · edited May 23 '17 at 12:01

I agree with the previous comments. You should run tests to determine what number yields the best performance. However, this will only yield the best performance for the particular system you're optimizing for. In most scenarios, your program will be run on other people's machines, on the architecture of which you should not make too many assumptions.

A good way to numerically determine the number of threads to start would be to use

std::thread::hardware_concurrency()

This is part of the C++11 and should yield the number of logical cores in the current system. Logical cores means either the physical number of cores - in case the processor does not support hardware threads (ie HyperThreading) - or the number of hardware threads.

There's also a Boost-function that does the same, see Programmatically find the number of cores on a machine.

score 2 · Answer 5 · answered Jan 17 '12 at 02:12

2

The optimal number of threads should equal the number of cores, in which situation the computation capacity of each core will be fully utilized, if the computation on each element is independently.

answered Jan 17 '12 at 02:12

ciphor

8,018
11
53
70

Olof Forshell · Answer 6 · 2013-09-09T10:59:04.860

The optimal number of cores (threads) will probably be determined by when you achieve saturation of the memory system (caches and RAM). Another factor that could come into play is that of inter-core locking (locking a memory area that other cores might want to access, updating it and then unlocking it) and how efficient it is (how long the lock is in place and how often it is locked/unlocked).

A single core running a generic software whose code and data are not optmized for multi-core will come close to saturating memory all by itself. Adding more cores will, in such a scenario, result in a slower application.

So unless your code economizes heavily on memory accesses I'd guess the answer to your question is one (1).

aderchox · Answer 7 · 2021-04-20T07:49:40.237

I've found a real world example I'll put here for the ones who want a less technical / more intuitional answer:

Having multiple threads per core is like having two queues in an airport for each scanner(which people on both queues eventually have to pass through).

Two people at a time can put their baggage on the conveyer belt, but only one at a time can pass through the scanner. Now at this point, obviously there's a contention point at the entrance of the scanner, but what happens in reality is most of the times both queues function very well.

In this example, the queues represent threads and the scanner is the main functions of a core. As a general rule of thumb, the impact of each thread is 1.25th a core, i.e., it's not like having an entire new core. So if the task is CPU-bound slightly over the number of available processors is probably best.

But notice that if the task is IO-Bound, where threads will be spending most of their time waiting for external resources such as database connections, file systems, or other external sources of data, then you can assign (many) more threads than the number of available processors.

Source1, Source2

This seems like a comment on the other answers, rather than an answer of its own. — Sneftel, Apr 16 '21 at 06:06

What is the best way to determine the number of threads to fire off in a machine with n cores? (C++)

7 Answers7