Choosing optimal number of Threads for parallel processing of data

Question

Let's say I have a task with processing 1 million sentences.

For each sentence, I need to do something with it, and it makes no matter what particular order they are processed in.

In my Java program I have a set of futures partitioned from my main chunk of work with a callable that defines the unit of work to be done on a chunk of sentences, and I'm looking for a way to optimize the number of threads I allocate to work through the big block of sentences, and later recombine all the results of each thread.

What would be the maximum number of threads I could use that would give me optimal performance in terms of speed before I saw diminishing returns?

Also, what causes the logic that the more threads allocated, ie more being able to be done at once, to be incorrect?

Take a look at this earlier question http://stackoverflow.com/questions/1718465/optimal-number-of-threads-per-core — DHall, Jun 10 '14 at 20:02

score 10 · Accepted Answer · edited Apr 15 '16 at 22:26

10

In practice, it can be difficult to find the optimal number of threads and even that number will likely vary each time you run the program. So, theoretically, the optimal number of threads will be the number of cores you have on your machine. If your cores are "hyper threaded" (as Intel calls it) it can run 2 threads on each core. Then, in that case, the optimal number of threads is double the number of cores on your machine.

Also, what causes the logic that the more threads allocated, i.e. 
more being able to be done at once, to be incorrect?

The reason that as more threads are allocated leads to more work being done concurrently is false because only 1 (or 2 threads if the cores are "hyper threaded") can run at a single time on each core.

So assume I have a quad core machine that is not hyper threaded. In that case, i can run up to 4 threads concurrently. So, my maximum throughput should be achieved with 4 threads. Say if I try to run 8 threads on the same setup. In this case, the kernel would schedule these threads back and forth (by way of a context switch), and would block one thread in order to let another thread run. So, at most, the work of 4 threads can be run at a single time.

For more information on this, it would be extremely helpful to look up "context switch" with a Linux kernel. That will provide you with all the information you ever wanted on this subject.

Also, note that there is a difference between threads called "user level threads" and "kernel level threads". This is an important distinction if you research this topic further, but it is outside the scope of this question.

edited Apr 15 '16 at 22:26

Mahdi

3,188
2
20
33

answered Jun 10 '14 at 20:05

Rich E

233
2
10

Why is it then that I just performed the exact computation using 2 multi threaded tasks where I allocated 50 threads for one task, and 25 threads for the other, I got a completion time of 11 seconds, but when I chose 5 threads, about the number of cores on my machine, it ran at almost 30 seconds? – Adam Bronfin Jun 11 '14 at 19:33
1

I know you mentioned in another comment that you pull data from a network. In that case, the work these threads are doing are not 100% CPU-Bound. Since it is getting data from a network there will be some blocking going on. By that I mean: when I thread needs to pull data from the network, it will have to wait for data to arrive over the network. While it is waiting, this thread is *blocked*. So, to get the optimal number of threads, it will be best just to experiment. – Rich E Jun 11 '14 at 20:48
Since you already did some experimentation (you said 11 sec. vs. 30 sec), I can explain those results. In this case that a thread is waiting on the network (thus, blocked), the kernel will schedule another thread to do work while that thread is waiting on the network. Then, once that data arrives, it will *context switch* back to that thread, and so on.. – Rich E Jun 11 '14 at 20:49
So then what categorically about this type of multi-threading allows me to use more threads than my # of cores? When would I only want to use the same # of threads as # of cores to have optimal performance? Because here it was clear that more threads the better (to some point I imagine). – Adam Bronfin Jun 11 '14 at 22:56
2

To sum it up nicely: When threads are 100% CPU-Bound (meaning that they only use the CPU and will not block by waiting on an event), then it is best to use the same number of threads as the number of cores. When threads *will* experience some blocking (in your case due to getting data over a network), then it is better to have more threads than the number of cores (there is not an "exact value" -- experimentation must be used to find an optimal range) since when one thread is waiting on the network (blocked), then kernel can schedule another thread to work, and so on. – Rich E Jun 11 '14 at 23:05
1

Consider the situation where I have 2 cores (able to run 1 thread each) and I spawn 2 threads `t1` and `t2`. So if `t1` becomes blocked, then the CPU will only be performing the work of a `t2` until `t1` is able to become unblocked again and contiue working. BUT if I had spawned 3 threads `t1` `t2` and `t3` and `t1` becomes blocked again, the kernel can schedule the for `t2` and `t3` to be running. – Rich E Jun 11 '14 at 23:09
I hope I answered your questions. I would appreciate it if you could accept this answer if you believe this satisfactorily answered the question. – Rich E Jun 12 '14 at 16:37

score 5 · Answer 2 · edited May 23 '17 at 10:34

Is your load I/O bound? I/O bound means the CPU waits most of the time for I/O operations being done. Adding more threads means, sending more requests to the I/O subsystem or a remote server, etc. This may have positive effects, because requests to storage can be reordered and combined (scatter gather), but only until you reach the maximum possible I/O bandwidth. Adding more threads may also have adverse effects, e.g. when more random I/O requests are executed on a conventional harddisk.

If your load is I/O bound you can do various approaches to optimize the I/O operations. My first choice is to load data in greater chunks and in a more streaming manner, if this is possible. The next thing is to use external index structures or databases if you have lots of point accesses or more disks, if just bandwidth is missing. Anyway, optimizing I/O is another broad topic...

Is your load CPU bound? This means for processing the CPU power is the limiting factor, not the I/O bandwith. Optimizing you I/O subsystem makes no sense in this case, you need more or faster CPUs and you need to distribute the load.

In your particular case, you can load all data into memory, then your load is solely CPU bound. For CPU bound loads it is the best to use a thread count identical to the number of CPU cores in your machine. Choosing the number of CPUs as thread count is rather straight forward and obvious. It also is discussed in the question Optimal number of threads per core.

In practice, to execute your tasks in the Callable objects use an ExecutorService constructed that way:

  int maxThreadCount = Runtime.getRuntime().availableProcessors();
  ExecutorService executor = 
    new ThreadPoolExecutor(
      0, maxThreadCount - 1,
      1, TimeUnit.SECONDS,
      new LinkedBlockingDeque<>(maxThreadCount * 2),
      Executors.defaultThreadFactory(),
      new ThreadPoolExecutor.CallerRunsPolicy());

Now do the processing by adding your tasks and wait until everything is finished:

  while (moreToDo) {
    Callable c =...
    executor.submit(c);
  }
  executor.shutdown();
  executor.awaitTermination(Long.MAX_VALUE, TimeUnit.MILLISECONDS);

The thread pool parameters are a little tricky. Here is a detailed explaination:

By using new ThreadPoolExecutor.CallerRunsPolicy() the task generator thread will stall generating new tasks when all threads in the pool are in use. To be more precise, the calling thread will execute a task as well, when the queue limit is reached.

maxThreadCount - 1: Since we also use the caller thread thread pool size is reduced by one.

new LinkedBlockingDeque<>(maxThreadCount * 2): For the queue size of the blocking queue a small value is chosen, the idea is, that by having some tasks in the queue, the pool threads get new jobs while the caller thread is processing a job itself. If tasks are very irregular in running time, this is not totally perfect. The ThreadPoolExecutor should have a cleaner approach for this use case. The better approach would be to use a SnychronosQueue and to make the submit wait until a thread is available. However, the ThreadPoolExecutor has no "always queue" operation mode, instead, it tries to queue and calls the RejectionPolicy if the queue is not possible right now.

This should do it in your scenario.

There may be loads when you don't know in advance whether it is CPU bound or I/O bound, and, to complicate things, the load may change its behavior within processing. My idea to tackle this, is with an adaptive algorithm similar to the approach in TCP congestion avoidance algorithm. The congestion avoidance in TCP is exactly the same sort of problem: "I want maximum throughput, but I don't know my resources". Anybody worked on this?

This is definitely a CPU bound load, no data is read from hard disk but rather through a network into RAM. — Adam Bronfin, Jun 11 '14 at 17:39

score 0 · Answer 3 · answered Jun 10 '14 at 21:14

Also, what causes the logic that the more threads allocated, ie more being able to be done at once, to be incorrect?

Are you asking, why does a computation running with N threads on an N core machine take longer than T/N time to complete when T is the time to do the same computation with just one thread?

Google "Amdahl's Law." It's rarely the case that 100% of the work can be done in parallel. Usually there is something, even if it's only startup/shutdown logic, that has to be done serially. The bits that have to be done serially have a big impact when you measure that speedup ratio.

Choosing optimal number of Threads for parallel processing of data

3 Answers3

Linked