Threads configuration based on no. of CPU-cores

Question

Scenario : I have a sample application and I have 3 different system configuration -

- 2 core processor, 2 GB RAM, 60 GB HHD,
- 4 core processor, 4 GB RAM, 80 GB HHD,
- 8 core processor, 8 GB RAM, 120 GB HHD

In order to effectively exploit the H/W capabilities for my application, I wish to configure the no. of threads at the application level. However, I wish to do this only after a thorough understanding of system capabilities.

Could there be some way(system/modus/tool) to determine the system prowess with reference to the max and min no. of threads it could service optimally & without any loss in efficiency and performance. By this, I could configure only those values for my application that will do full justice and achieve best performance for the respective hardware configuration.

Edited1 : Could any one please advise any read-up on how to set a baseline for a particular h/w config.

Edited2 : To make it more direct - Wish to learn/know about any resource/write-up that I can read to gain some understanding on CPU management of Threads at a general/holistic level.

I want to find the optimal values for Minimum no. of Threads / Maximum no. of Threads for the sample application based on the above mentioned system configuration to achieve best performance and full resource utilization. — Santosh, Dec 12 '12 at 07:19
If you don't want to go with the 'heuristic' answers, all that is left is experimental design. Try some settings, and you will certainly find local maxima/minima. — Felix Dobslaw, Dec 19 '12 at 19:05

assylias · Accepted Answer · 2013-08-30T08:57:11.430

The optimal number of threads to use depends on several factors, but mostly the number of available processors and how cpu-intensive your tasks are. Java Concurrency in Practice proposes the following formal formula to estimate the optimal number of threads:

N_threads = N_cpu * U_cpu * (1 + W / C)

Where:

N_threads is the optimal number of threads
N_cpu is the number of prcessors, which you can obtain from Runtime.getRuntime().availableProcessors();
U_cpu is the target CPU utilization (1 if you want to use the full available resources)
W / C is the ratio of wait time to compute time (0 for CPU-bound task, maybe 10 or 100 for slow I/O tasks)

So for example, in a CPU-bound scenario, you would have as many threads as CPU (some advocate to use that number + 1 but I have never seen that it made a significant difference).

For a slow I/O process, for example a web crawler, W/C could be 10 if downloading a page is 10 times slower than processing it, in which case using 100 threads would be useful.

Note however that there is an upper bound in practice (using 10,000 threads will generally not speed things up, and you would probably get an OutOfMemoryError before you can start them all anyway with normal memory settings).

This is probably the best estimate you can get if you don't know anything about the environment in which your application runs. Profiling your application in production might enable you to fine tune the settings.

Although not strictly related, you might also be interested in Amdahl's law, which aims at measuring the maximum speed-up you can expect from parallelising a program.

How do I get an estimate of W/C? Do I need to find the exact time I/O vs Compute is taking? — AgentX, Aug 23 '17 at 15:54

score 16 · Answer 2 · answered Dec 24 '12 at 18:29

My recommendation is to provide config and command-line switches for assigning the number of threads per-machine. Use a heuristic based on Runtime.getRuntime().availableProcessors() as indicated by other answers here, in cases where the user/admin hasn't explicitly configured the application differently. I strongly recommend against exclusive heuristic-based thread-to-core guessing, for several reasons:

Most modern hardware is moving toward increasingly ambiguous types of 'hardware threads': SMT models such as Intel's Hyperthreading and AMD's Compute Modules complicate formulas (details below), and querying this info at runtime can be difficult.
Most modern hardware has a turbo feature that scales speed based on active cores and ambient temperatures. As turbo tech improves, the range of speed (ghz) grows. Some recent Intel and AMD chips can range from 2.6ghz (all cores active) to 3.6ghz (single/dual core active), which combined with SMT can mean each thread getting an effective 1.6ghz - 2.0ghz throughput in the former design. There is currently no way to query this info at runtime.
If you do not have a strong guarantee that your application will be the only process running on the target systems, then blindly consuming all cpu resources may not please the user or server admin (depending on if the software is a user app or server app).

There is no robust way to know what's going on within the rest of the machine at run-time, without replacing the entire operating system with your own home-rolled multitasking kernel. Your software can try to make educated guesses by querying processes and peeking at CPU loads and such, but doing so is complicated and usefulness is limited to specific types of applications (of which yours may qualify), and usually benefit from or require elevated or privileged access levels.

Modern virus scanners now-days work by setting a special priority flag provided by modern operating systems, eg. they let the OS tell them when "the system is idle". The OS bases its decision on more than just CPU load: it also considers user input and multimedia flags that may have been set by movie players, etc. This is fine for mostly-idle tasks, but not useful to a cpu intensive task such as yours.
Distributed home computing apps (BOINC, Folding@Home, etc) work by querying running processes and system CPU load periodically -- once every second or half-second perhaps. If load is detected on processes not belonging to the app for multiple queries in a row then the app will suspend computation. Once the load goes low for some number of queries, it resumes. Multiple queries are required because the CPU load readouts are notorious for brief spikes. There are still caveats: 1. Users are still encouraged to manually reconfigure BOINC to fit their machine's specs. 2. if BOINC is run without Admin privileges then it won't be aware of processes started by other users (including some service processes), so it may unfairly compete with those for CPU resources.

Regarding SMT (HyperThreading, Compute Modules):

Most SMTs will report as hardware cores or threads these days, which is usually not good because few applications perform optimally when scaled across every core on an SMT system. To make matters worse, querying whether a core is shared (SMT) or dedicated often fails to yield expected results. In some cases the OS itself simply doesn't know (Windows 7 being unaware of AMD Bulldozer's shared core design, for example). If you can get a reliable SMT count, then the rule of thumb is to count each SMT as half-a-thread for CPU-intensive tasks, and as a full thread for mostly-idle tasks. But in reality, the weight of the SMT depends on what sort of computation its doing, and the target architecture. Intel and AMD's SMT implementations behave almost opposite of each other, for example -- Intel's is strong at running tasks loaded with integer and branching ops in parallel. AMD's is strong at running SIMD and memory ops in parallel.

Regarding Turbo Features:

Most CPUs these days have very effective built-in Turbo support that further lessens the value-gained from scaling across all cores of the system. Worse, the turbo feature is sometimes based as much on real temperature of the system as it is on CPU loads, so the cooling system of the tower itself affects the speed as much as the CPU specs do. On a particular AMD A10 (Bulldozer), for example, I observed it running at 3.7ghz on two threads. It dropped to 3.5ghz when a third thread is started, and to 3.4ghz when a fourth was started. Since it's an integrated GPU as well, it dropped all the way to approx 3.0ghz when four threads plus the GPU were working (the A10 CPU internally gives priority to the GPU in high-load scenarios); but could still muster 3.6ghz with 2 threads and GPU active. Since my application used both CPU and GPU, this was a critical discovery. I was able to improve overall performance by limiting the process to two CPU-bound threads (the other two shared cores were still helpful, they served as GPU servicing threads -- able to wake up and respond quickly to push new data to the GPU, as needed).

... but at the same time, my application at 4x threads may have performed much better on a system with a higher-quality cooling device installed. It's all so very complicated.

Conclusion: There is no good answer, and because the field of CPU SMT/Turbo design keeps evolving, I doubt there will be a good answer anytime soon. Any decent heuristic you formulate today may very well not produce ideal results tomorrow. So my recommendation is: don't waste much time on it. Rough-guess something based on core counts that suits local your purposes well enough, allow it to be overridden by config/switch, and move on.

I like your answer, but would you change/extend anything after a decade? — Robert Gonciarz, May 26 '23 at 10:40

score 14 · Answer 3 · answered Dec 12 '12 at 07:56

14

You can get the number of processors available to the JVM like this:

Runtime.getRuntime().availableProcessors()

Calculating the optimal number of threads from the number of available processors is unfortunately not trivial however. This depends a lot on the characteristics of the application, for instance with a CPU-bound application having more threads than the number of processors make little sense, while if the application is mostly IO-bound you might want to use more threads. You also need to take into account if other resource intensive processes are running on the system.

I think the best strategy would be to decide the optimal number of threads empirically for each of the hardware configuration, and then use these numbers in your application.

answered Dec 12 '12 at 07:56

Gustav Grusell

1,166
6
7

Mine is a CPU intensive process. Also, can I get any read-up on how to set a baseline for a particular h/w config. Any way in which I can find out if a particular processor can use all its available resources or are any blocked due to other software running. – Santosh Dec 12 '12 at 11:45
3

@Santosh If it is CPU intensive, then using `availableProcessors()` number of threads should be close to optimal. – assylias Dec 12 '12 at 12:35
I usually add a small constant factor to pick up scheduling slop in case one of the threads gets blocked on IO or something... – Steven Schlansker Dec 14 '12 at 05:48
#Sharing link : Nice post on CPU-bound/IO-bound application - http://stackoverflow.com/questions/868568/cpu-bound-and-i-o-bound . – Santosh Dec 17 '12 at 07:05
2

As far as question is concerned the buy want to performance on multicore machine. Runtime.getRuntime().availableProcessors() will give us the cores available to jvm which is mostly equal to number of cores but the point is how to utilize the cores power. That is by giving as much and optimum work to multiple cpu and dont let them stay. Can be done if your app thread level is equal to no of cores assigned to JVM ultimately!!!!!!!!! – Vaibs Dec 26 '12 at 09:26

score 4 · Answer 4 · answered Dec 25 '12 at 01:26

I agree with the other answers here that recommend a best-guess approach, and providing configuration for overriding the defaults.

In addition, if your application is particularly CPU-intensive, you may want to look into "pinning" your application to particular processors.

You don't say what your primary operating system is, or whether you're supporting multiple operating systems, but most have some way of doing this. Linux, for instance, has taskset.

A common approach is to avoid CPU 0 (always used by the OS), and to set your application's cpu affinity to a group of CPUs that are in the same socket.

Keeping the app's threads away from cpu 0 (and, if possible, away from other applications) often improves performance by reducing the amount of task switching.

Keeping the application on one socket can further increase performance by reducing cache invalidation as your app's threads switch among cpus.

As with everything else, this is highly dependent on the architecture of the machine that you are running on, as well as what other applications are runnning.

score 2 · Answer 5 · edited Dec 24 '12 at 15:23

2

Use VisualVm tool to monitor threads.First Create minimum threads in program and see its performance.Then increase the no of threads within the program ans again analyze its performance.May this help you.

edited Dec 24 '12 at 15:23

kuporific

10,053
3
42
46

answered Dec 21 '12 at 13:00

abishkar bhattarai

7,371
8
49
66

score 1 · Answer 6 · answered Dec 25 '12 at 06:39

I use this Python script here to determine the number of cores (and memory, etc.) to launch my Java application with optimum parameters and ergonomics. PlatformWise on Github

It works like this: Write a python script which calls the getNumberOfCPUCores() in the above script to get the number of cores, and getSystemMemoryInMB() to get the RAM. You can pass that inform to your program via command line arguments. Your program can then use the appropriate number of threads based on the number of cores.

score 1 · Answer 7 · answered Dec 26 '12 at 09:17

Creating a thread on application level is good and in a multicore processor separate threads are executed on cores to enhance performance.So to utilize the core processing power it is best practice to implement threading.

What i think:

At a time only 1 thread of a program will execute on 1 core.
Same application with 2 thread will execute on half time on 2 core.
Same application with 4 Threads will execute more faster on 4 core.

So the application you developing should have the threading level<= no of cores.

Thread execution time is managed by the operating system and is a highly unpredictable activity. CPU execution time is known as a time slice or a quantum. If we create more and more threads the operating system spends a fraction of this time slice in deciding which thread goes first, thus reducing the actual execution time each thread gets. In other words each thread will do lesser work if there were a large number of threads queued up.

Read this to get how to actually utilize cpu core's.Fantastic content. csharp-codesamples.com/2009/03/threading-on-multi-core-cpus/

score 1 · Answer 8 · answered Sep 06 '17 at 04:25

Calculating the optimal number of threads from the number of available processors is unfortunately not trivial however. This depends a lot on the characteristics of the application, for instance with a CPU-bound application having more threads than the number of processors make little sense, while if the application is mostly IO-bound you might want to use more threads. You also need to take into account if other resource intensive processes are running on the system.

Threads configuration based on no. of CPU-cores

8 Answers8

Linked

Related