3

I'm trying to do a simple experiment where I want to find out what is the right size of a thread pool when you got a bunch of CPU intensive tasks.

I already know that this size should be equal to the number of cores on the machine, but I want to prove that empirically. Here is the code:

public class Main {

    public static void main(String[] args) throws ExecutionException {
        List<Future> futures = new ArrayList<>();
        ExecutorService threadPool = Executors.newFixedThreadPool(4);

        long startTime = System.currentTimeMillis();

        for (int i = 0; i < 100; i++) {
            futures.add(threadPool.submit(new CpuBoundTask()));
        }

        for (int i = 0; i < futures.size(); i++) {
            futures.get(i).get();
        }

        long endTime = System.currentTimeMillis();
        System.out.println("Time = " + (endTime - startTime));
        threadPool.shutdown();
    }

    static class CpuBoundTask implements Runnable {
        @Override
        public void run() {
            int a = 0;
            for (int i = 0; i < 90000000; i++) {
                a = (int) (a + Math.tan(a));
            }
        }
    }
}

Each task executes in about 700 milliseconds (I think that's enough to be preempted by the ThreadScheduler at least once).

I'm running this on a MacbookPro 2017, 3.1 GHz Intel Core i5, 2 physical cores with hyperthreading activated, so 4 logical CPUs.

I adjusted the size of the threadpool, and I ran this program multiple times (averaging the timings). Here are the results:

1 thread = 57 seconds
2 threads = 29 seconds
4 threads = 18 seconds
8 threads = 18.1 seconds
16 threads = 18.2 seconds
32 threads = 17.8 seconds
64 threads = 18.2 seconds

I was expecting the execution time to be significantly higher, once I add so many threads (more than the number of CPU cores), because of the context switch overhead, but it seems that this doesn't really happen.

I used VisualVM to monitor the program, and looks like all the threads get created and they are in the running state, as expected. Also, the CPU seems to be used properly (close to 95%).

Is there something that I'm missing?

dreamcrash
  • 47,137
  • 25
  • 94
  • 117
Cosmin Ioniță
  • 3,598
  • 4
  • 23
  • 48

4 Answers4

3

In this case you should use System.nanoTime() instead of System.currentTimeMillis().

Your algorithm stops scaling at 4 threads, for simplicity sake, let us assume that all the threads did the same number of tasks, hence 25 per thread. Each thread took 18 seconds more or less to compute 25 iterations.

In a very simplistic way, when you run with 64 threads, you will have 8 threads per core, and with the first 4 iterations there are 4 threads running (1 per core) in parallel and the other 60 threads are in idle mode waiting for CPU resources to computed their iterations, so you have something like:

Iteration 0 : Thread 1 (running)
Iteration 1 : Thread 2 (running)
Iteration 2 : Thread 3 (running)
Iteration 3 : Thread 4 (running)
Iteration 4 : Thread 5 (waiting)
Iteration 5 : Thread 6 (waiting)
Iteration 6 : Thread 7 (waiting)
Iteration 7 : Thread 8 (waiting)
...
Iteration 63 : Thread 64 (waiting)

when those 4 threads finish their iterations, they will get another iteration each. In the meantime, let us say, that the threads 5 to 8 start working on the next four iterations (again 4 threads performing work in parallel) while the other threads are blocked waiting for CPU, and so on. So you always have 4 threads running in parallel, regardless, and that is why for:

8 threads = 18.1 seconds
16 threads = 18.2 seconds
32 threads = 17.8 seconds
64 threads = 18.2 seconds

you have around the same execution time, approximately the same execution time that took for 4 threads to finishing 25 iterations in parallel.

Because this is a CPU-bound algorithm without problems of:

  1. synchronization;
  2. loading unbalancing (i.e., each loop iteration takes approximately the same execution time);
  3. memory bandwidth saturation;
  4. cache invalidation;
  5. false sharing.

it does not reflect that much on the overall execution time when you increase the number of threads per core.

dreamcrash
  • 47,137
  • 25
  • 94
  • 117
3

First off, the assumption that the context switch overhead increases with the number of threads, is not always correct. Your sample program performs the fixed amount of work. The more threads you have - the less work each thread does, and the less CPU time it receives.

Even when you have hundreds of threads, the OS will not switch between them infinitely often. There is usually a minimum interval (time slice) that a thread is allowed to run without preemption. With too many threads competing for a physical core, each thread will receive its cpu time slice less often (i.e. starvation), but the number of context switches will not grow proportionally to the number of threads.

I measured the number of context switches in your program with Linux perf:

perf stat -e context-switches java Main

And here are the results:

 2 threads | 1,445 context-switches
 4 threads | 2,417 context-switches
 8 threads | 9,280 context-siwtches
16 threads | 9,257 context-switches
32 threads | 9,527 context-switches
64 threads | 9,986 context-switches

A great leap in the context switches expectedly happens when the number of threads becomes more than the number of physical CPUs, but afterwards the number does not grow that much.

OK, wee see roughly 10K context switches. Is that much? As the answers suggest, the latency of a context switch can be estimated as several microseconds. Let's take 10 as an upper bound. So, 10K switches together will take about 100ms, or 25ms per CPU. It's unlikely that your test will detect this overhead. Furthermore, all threads are purely CPU bound - they do not even access memory enough to suffer from CPU cache pollution. They do not access other shared resources either, so there is no indirect context switch overhead in this case.

apangin
  • 92,924
  • 10
  • 193
  • 247
2

I was expecting the execution time to be significantly higher, once I add so many threads (more than the number of CPU cores), because of the context switch overhead, but it seems that this doesn't really happen.

It's going to be very hard to detect this for a number of reasons. First off, modern operating systems are extremely good at optimizing for this use case. Context switching used to be a big hammer but with modern memory architectures, it is much less costly to do so.

The penalty for context switching is the memory cache flushing. When a thread is swapped into a CPU, the local cached memory may not hold any of the per-thread information necessary for it to do its calculations. It has to go to main memory to read the memory lines needed which is slower. It's also slower to get swapped out because any dirty pages will have to be written to main memory. For this reason, I think you might see a higher context switching penalty if your task used a ton more cached memory. Your current program just stores a couple of integers. For example, let's say you allocate ~10k at the start of your program for each thread and put random values into it. Then when each of the threads run, they try to random access data from their corresponding 10k chunk that would move into CPU cached memory. That might be a better experiment. But that said you are going to have to know a lot about your architecture and optimize your application appropriately to fully detect the context switches.

Lastly, like any Java test program, you should run for a minute so the class hot-swapping and other optimizations settle, then run collecting data for a long amount of time. Running a test that takes 18 seconds is exercising the JVM more than it is your test code. You might see some sort of measurable difference if you ran for (let's say) 1800 seconds. And, as @dreamcrash mentioned, using System.nanoTime() should be used for fine-grained timing calculations like this.

Gray
  • 115,027
  • 24
  • 293
  • 354
2

Executors.newWorkStealingPool

If you are using Java 8, use workStealingThreadPool as it may give the best results:

ExecutorService es = Executors.newWorkStealingPool();

Creates a work-stealing thread pool using all available processors as its target parallelism level. The parallelism level corresponds to the maximum number of threads actively engaged in, or available to engage in, task processing. The actual number of threads may grow and shrink dynamically. A work-stealing pool makes no guarantees about the order in which submitted tasks are executed.

Basil Bourque
  • 303,325
  • 100
  • 852
  • 1,154
prasune
  • 41
  • 1
  • Thank you for your answer! As far as I know, `newWorkStealingPool` will create a ForkJoinPool, which works best on a divide-and-conquer execution model, which is not what I'm trying to test here (I want to know how to get most out of ThreadPoolExecutor) – Cosmin Ioniță Dec 19 '20 at 19:24
  • You are not answering the actual question here or addressing the problem. – akuzminykh Dec 19 '20 at 21:30