Multithreading not speeding up the process in Java

Question

I'm trying to run 4 threads at the same time, to hopefully speed up a process (that can be done in parallel with no problem). For now, each thread calls the same function (that always takes roughly the same amount of time), and the threads are started (almost) at the same time. Using print statements I've checked how much time was passing between each threadX.join(), and the time is exactly the same as it would be if I ran the function four times sequentially, but in the control panel I can see that the 4 threads are indeed used (at the beginning 4 of them, then once the first is joined just 3, then two, etc).

This is my code:

long start = System.currentTimeMillis();

        Foo thread01 = new Foo();
        Foo thread02 = new Foo();
        Foo thread03 = new Foo();
        Foo thread04 = new Foo();

        System.out.println("Created all thread0x");

        Thread thread1 = new Thread(thread01);
        Thread thread2 = new Thread(thread02);
        Thread thread3 = new Thread(thread03);
        Thread thread4 = new Thread(thread04);

        thread1.start();
        thread2.start();
        thread3.start();
        thread4.start();

        try {
            thread1.join();
            thread2.join();
            thread3.join();
            thread4.join();
        } catch (InterruptedException e) {
            e.printStackTrace();
        }

        double value1 = thread01.getResult();
        double value2 = thread02.getResult();
        double value3 = thread03.getResult();
        double value4 = thread04.getResult();

        System.out.println("Result1: " + value1);
        System.out.println("Result2: " + value2);
        System.out.println("Result3: " + value3);
        System.out.println("Result4: " + value4);

        long end = System.currentTimeMillis();
        long elapsedTime = end - start;
        System.out.println("Elapsed time: " + elapsedTime / 1000.0);

What am I doing wrong, or what have I not understood?

Here is the code of the test class.

public class Foo implements Runnable {
    private double result;

    @Override
    public void run() {
        fakeMiniMax(0);
    }

    public double fakeMiniMax(int iterationNumber) {
        int total = 0;

        if (iterationNumber > 6) {
            this.result = 100;
            return total;
        }
        else {
            this.result = -1;
            return this.fakeMiniMax(iterationNumber + 1) + this.fakeMiniMax(iterationNumber + 1) - this.fakeMiniMax(iterationNumber + 1) + this.fakeMiniMax(iterationNumber + 1) / 20 +  this.fakeMiniMax(iterationNumber + 1) - 45 -  this.fakeMiniMax(iterationNumber + 1) - this.fakeMiniMax(iterationNumber + 1) -  this.fakeMiniMax(iterationNumber + 1) -  this.fakeMiniMax(iterationNumber + 1) -  this.fakeMiniMax(iterationNumber + 1) +  this.fakeMiniMax(iterationNumber + 1) +  this.fakeMiniMax(iterationNumber + 1) + this.fakeMiniMax(iterationNumber + 1) + this.fakeMiniMax(iterationNumber + 1) - this.fakeMiniMax(iterationNumber + 1) -  this.fakeMiniMax(iterationNumber + 1) - this.fakeMiniMax(iterationNumber + 1) -  this.fakeMiniMax(iterationNumber + 1) - this.fakeMiniMax(iterationNumber + 1) -  this.fakeMiniMax(iterationNumber + 1) - this.fakeMiniMax(iterationNumber + 1) -  this.fakeMiniMax(iterationNumber + 1) + this.fakeMiniMax(iterationNumber + 1) +  this.fakeMiniMax(iterationNumber + 1) + this.fakeMiniMax(iterationNumber + 1) +  this.fakeMiniMax(iterationNumber + 1) + this.fakeMiniMax(iterationNumber + 1) +  this.fakeMiniMax(iterationNumber + 1) + this.fakeMiniMax(iterationNumber + 1) +  this.fakeMiniMax(iterationNumber + 1) + this.fakeMiniMax(iterationNumber + 1) +  this.fakeMiniMax(iterationNumber + 1);
        }
    }

    double getResult() {
        return result;
    }
}

My hardware:

4 cores
8 threads
16GB of RAM
The task only uses the CPU continuously until the method returns its result.
The code of the task is all written above

@camickr That's interesting, I thought the point of multithreading was also to improve performance of a task that can be done in parallel. Is there any other way to speed up a program that would use the CPU a lot? And, is the result I'm having normal then? Thank you. — Enrico Cortinovis, May 08 '21 at 16:21
@BuildSlayer I added it to my question. This was a simple test class that was simulating a "MiniMax" algorithm that I'm going to replace it with (if I manage to get it to work with this simpler example). It's basically a recursive algorithm that gets called 20^6 times in total. — Enrico Cortinovis, May 08 '21 at 16:58
@camickr On the contrary, CPU-bound tasks work very well concurrently if run on hardware with multiple cores. — Basil Bourque, May 08 '21 at 18:01
As a test case your `Foo` class rather fails. Are we actually supposed to do the work to figure out how many iterations that class uses recursively? I think a simple loop would be much more readable here. — markspace, May 08 '21 at 18:03
@markspace No. It was also useful to help me understand how many "ply" deep I could have searched in a minute (6), to then see the difference with multithreading. — Enrico Cortinovis, May 08 '21 at 18:04
Were you expecting the recursion to be different? Recursion runs on a single call stack and in a single thread of execution. You won't go "deeper" because of multi-threading. You'll get four times the work done (four cores, four threads) but each thread goes no deeper. — markspace, May 08 '21 at 18:07
@markspace That's right, In my last comment I did not explain myself well. My goal was to see IF I was able to make it run faster, and by how much, to then change the way it is ran so that multithreading is useful. Right now I'm just running it four times instead of one, so practically it would be useless. What I was expecting was that it would have taken roughly the same amount of time as running it once, on one thread only. — Enrico Cortinovis, May 08 '21 at 18:11
OK, it's possible this is just your OS scheduling. Now that you have changed the `Foo` class (again), its operation looks like it's too short to run for a while, so the OS just runs each short task sequentially, because it can. Concurrency is permitted in multithreading, but not guaranteed. Remember the OS has other functions, and the cores may be in use for that. — markspace, May 08 '21 at 18:18
Another thing I see is that you're using `System.currentTimeMillis()` which is not very accurate for short time periods. `System.nanoTime()` would be much better. — markspace, May 08 '21 at 18:20
@markspace Thank you very much, I will use that instead. And about changing the `Foo` class, unfortunately someone else had changed it (making it very unreadable) and I had to change it back again (currently it's the correct version). Also, the program runs for about a minute like that, because each time I'm calling the method 20 times, so it gets called around 20^6 times, so it should indeed run enough time to let the system put the 4 tasks in parallel. And it seems that they do, since in the control panel it says that Java is using 60-70% CPU (when using a single thread it stays under 20%). — Enrico Cortinovis, May 08 '21 at 18:30

Basil Bourque · Answer 1 · 2021-05-09T06:25:27.567

tl;dr

I too see strange results where running your recursive tasks concurrently takes much more time, not less. I have no idea why.

Details

At a glance, you code looks correct, though I did not study the recursive calls. (Minor issue: My IDE says your FakeMiniMax#getResult method is never called.)

Executors framework

But… your code is obsolete. In modern Java, we rarely need to address the Thread class directly. Instead, we use the Executors framework added to Java 5. That framework relieves us of the chore of juggling threads and managing thread pools.

As commented by markspace, use System.nanoTime for more accurate benchmark timing. If you really want to get serious with benchmarking, use Java Microbenchmark Harness (JMH). Also, use Duration to track elapsed time rather than doing the math (/1_000).

`Callable`

You want to gather a result. So we use Callable rather than Runnable.

`Double` return object

Lets use an object Double as the return value from our recursive method, rather than the primitive double. We must do this to parameterize the type of our Callable.

`Future` objects

When submitting a task to an executor service, we get back a Future object. This object can be used to track the completion status. We will retrieve the result of the task from this Future object: a Double object. As we submit our multiple tasks to the executor service, we collect each Future object returned to us.

We can submit multiple tasks to an executor service, then wait for them to finish. The current way to do so is to first call ExecutorService#shutdown, followed by a call to awaitTermination with a time-out. When the submitted tasks are done or cancelled, flow-of-control moves beyond the awaitTermination call.

Example code

Here is my revised version of your FakeMiniMax class.

package work.basil.demo.threadmark;

import java.time.Instant;
import java.util.concurrent.Callable;

public class FakeMiniMax implements Callable < Double >
{
    private double result;

    // `Callable` interface
    @Override
    public Double call ( ) throws Exception
    {
        System.out.println( "INFO - The `call` method of `FakeMiniMax` is starting to run in thread ID " + Thread.currentThread().getId() + " at " + Instant.now() + "." );
        return fakeMiniMax( 0 );
    }

    public double fakeMiniMax ( int iterationNumber )
    {
        int total = 0;

        if ( iterationNumber > 6 )
        {
            this.result = 100;
            return total;
        } else
        {
            this.result = - 1;
            return this.fakeMiniMax( iterationNumber + 1 ) + this.fakeMiniMax( iterationNumber + 1 ) - this.fakeMiniMax( iterationNumber + 1 ) + this.fakeMiniMax( iterationNumber + 1 ) / 20 + this.fakeMiniMax( iterationNumber + 1 ) - 45 - this.fakeMiniMax( iterationNumber + 1 ) - this.fakeMiniMax( iterationNumber + 1 ) - this.fakeMiniMax( iterationNumber + 1 ) - this.fakeMiniMax( iterationNumber + 1 ) - this.fakeMiniMax( iterationNumber + 1 ) + this.fakeMiniMax( iterationNumber + 1 ) + this.fakeMiniMax( iterationNumber + 1 ) + this.fakeMiniMax( iterationNumber + 1 ) + this.fakeMiniMax( iterationNumber + 1 ) - this.fakeMiniMax( iterationNumber + 1 ) - this.fakeMiniMax( iterationNumber + 1 ) - this.fakeMiniMax( iterationNumber + 1 ) - this.fakeMiniMax( iterationNumber + 1 ) - this.fakeMiniMax( iterationNumber + 1 ) - this.fakeMiniMax( iterationNumber + 1 ) - this.fakeMiniMax( iterationNumber + 1 ) - this.fakeMiniMax( iterationNumber + 1 ) + this.fakeMiniMax( iterationNumber + 1 ) + this.fakeMiniMax( iterationNumber + 1 ) + this.fakeMiniMax( iterationNumber + 1 ) + this.fakeMiniMax( iterationNumber + 1 ) + this.fakeMiniMax( iterationNumber + 1 ) + this.fakeMiniMax( iterationNumber + 1 ) + this.fakeMiniMax( iterationNumber + 1 ) + this.fakeMiniMax( iterationNumber + 1 ) + this.fakeMiniMax( iterationNumber + 1 ) + this.fakeMiniMax( iterationNumber + 1 );
        }
    }

    double getResult ( )
    {
        return result;
    }
}

And here is the App class that acts as a test harness.

package work.basil.demo.threadmark;

import java.time.Duration;
import java.time.Instant;
import java.util.ArrayList;
import java.util.List;
import java.util.Objects;
import java.util.concurrent.*;

public class App
{
    public static void main ( String[] args )
    {
        System.out.println( "INFO - The `main` method of `App` is starting to run in thread ID " + Thread.currentThread().getId() + " at " + Instant.now() + "." );
        long start = System.nanoTime();

        ExecutorService executorService = Executors.newFixedThreadPool( 4 );  // Executors.newSingleThreadExecutor() or  Executors.newFixedThreadPool( 4 )

        // Populate a list of tasks to be executed on background threads.
        int countTasks = 4;
        List < FakeMiniMax > tasks = new ArrayList <>( countTasks );
        for ( int nthTask = 0 ; nthTask < countTasks ; nthTask++ )
        {
            tasks.add( new FakeMiniMax() );
        }

        // Submit list of tasks to executor service. Collect each task’s `Future` object.
        List < Future < Double > > futures = null;
        try
        {
            futures = executorService.invokeAll( tasks );
            executorService.shutdown(); // Prevents adding any more tasks.
            try { executorService.awaitTermination( Duration.ofSeconds( 10 ).toMillis() , TimeUnit.MILLISECONDS ); } catch ( InterruptedException e ) { e.printStackTrace(); }

            // Review the list of `Future` objects collected for each of all the tasks.
            for ( Future < Double > future : futures )
            {
                try { System.out.println( "future.get() = " + future.get() ); } catch ( InterruptedException e ) { e.printStackTrace(); } catch ( ExecutionException e ) { e.printStackTrace(); }
            }
        }
        catch ( InterruptedException e )
        {
            e.printStackTrace();
        }
        finally
        {
            if ( Objects.nonNull( executorService ) && ! executorService.isShutdown() )
            {
                executorService.shutdownNow(); // Prevents adding any more tasks.
            }
        }

        Duration elapsed = Duration.ofNanos( System.nanoTime() - start );
        System.out.println( "Our " + countTasks + " tasks took " + elapsed + "." );
        System.out.println( "INFO - The `main` method of `App` is ending its run in thread ID " + Thread.currentThread().getId() + " at " + Instant.now() + "." );
    }
}

Notice that we can swap out our chosen implementation of ExecutorService. We can use a single-threaded executor service to run our tasks sequentially using only a single core. Or we can switch to a multi-threaded executor service backed by a fixed pool of threads. Just change that one line of code to run this test yourself in both modes.

Test results

When I run this on a fairly busy Mac mini (way too many web browser windows & tabs open!) with an Intel chip containing six cores but no Hyper-Threading, 32 gigs RAM, and Java 16, I get the following results.

Single-threaded

With a single thread, elapsed time is about 2.5 minutes.

Notice how the timestamps show each call method running sequentially over time. The second does not start until the first is done, and so on.

Console output:

INFO - The `main` method of `App` is starting to run in thread ID 1 at 2021-05-09T06:05:39.348246Z.
INFO - The `call` method of `FakeMiniMax` is starting to run in thread ID 14 at 2021-05-09T06:05:39.417155Z.
INFO - The `call` method of `FakeMiniMax` is starting to run in thread ID 14 at 2021-05-09T06:06:30.348844Z.
INFO - The `call` method of `FakeMiniMax` is starting to run in thread ID 14 at 2021-05-09T06:07:07.385369Z.
INFO - The `call` method of `FakeMiniMax` is starting to run in thread ID 14 at 2021-05-09T06:07:43.701939Z.
future.get() = -53874.172111640604
future.get() = -53874.172111640604
future.get() = -53874.172111640604
future.get() = -53874.172111640604
Our 4 tasks took PT2M38.344593758S.
INFO - The `main` method of `App` is ending its run in thread ID 1 at 2021-05-09T06:08:17.737893Z.

Four threads

With a thread-pool of four threads, elapsed time is 7 minutes!

I have no idea why this is taking more time rather than the single-thread time, 7 minutes versus 2.5 minutes.

Notice how the timestamps show all four tasks starting at the same moment, within the same millisecond. Then time goes by as all four do their computational work simultaneously.

Console output:

INFO - The `main` method of `App` is starting to run in thread ID 1 at 2021-05-09T05:52:53.511070Z.
INFO - The `call` method of `FakeMiniMax` is starting to run in thread ID 15 at 2021-05-09T05:52:53.540791Z.
INFO - The `call` method of `FakeMiniMax` is starting to run in thread ID 14 at 2021-05-09T05:52:53.540791Z.
INFO - The `call` method of `FakeMiniMax` is starting to run in thread ID 16 at 2021-05-09T05:52:53.541806Z.
INFO - The `call` method of `FakeMiniMax` is starting to run in thread ID 17 at 2021-05-09T05:52:53.541910Z.
future.get() = -53874.172111640604
future.get() = -53874.172111640604
future.get() = -53874.172111640604
future.get() = -53874.172111640604
Our 4 tasks took PT7M6.872664491S.
INFO - The `main` method of `App` is ending its run in thread ID 1 at 2021-05-09T06:00:00.406725Z.

Conclusion

In my experiments running a revised version of your CPU-intensive code, like you I get crazy results. Rather than saving time, concurrency takes more time.

I am left befuddled by these results. I expected using four threads with six cores would cut the time by 3/4, but instead I see the time more than double. I ran this test at least a dozen times. For the single-threaded mode, I can see one of my six cores jumps to nearly 100% utilization for the whole time. In multi-threaded mode, I clearly see all 6 cores (not 4!) jump from 10-20% utilization to about 70-90% utilization. So the threads certainly seem to be running simultaneously across cores.

I hope that some one can study this code to offer an explanation.

My only uneducated wild thought: Is there only a single Floating-Point Unit (FPU) on my 3 GHz Intel Core i5 chip being shared across cores and therefore acting as a bottleneck for all your recursive double math operations? Even if that were the case, why so much slower?

Tip: Notice how we include a call to Instant.now on every call to System.out.println. Study these timestamps! When accessing System.out across threads, the output may be interleaved in unexpected ways. Never assume the sequential lines of console happened chronologically.

score -3 · Answer 2 · answered May 08 '21 at 16:22

-3

Thread.join is a blocking operation. There is a chance the threads are being executed in the same order you call join. Anyway, it's not the best option to handle threads manually in Java. There are better alternatives. In your case, I would recommend you to try Future and CompletionService. Here is an example

answered May 08 '21 at 16:22

Semyon Kirekov

1,237
8
20

Thank you for the answer. As I wrote in the question, from the control panel I can see that the program is indeed using roughly 4x the CPU power than before, so it looks like the threads are indeed running the function at the same time. But then for some reason they keep running it later, or even run it twice? This is what I don't understand. – Enrico Cortinovis May 08 '21 at 16:25
There are hundreds of threads running in the OS. You can't be sure which one will be selected to run next time – Semyon Kirekov May 08 '21 at 16:28
That doesn't have to do with my question. If I run one thread only, I see that the "OpenJDK Platform binary" process uses 20% CPU. If I run four threads, I see that that same process starts to use 60-70% of my CPU. – Enrico Cortinovis May 08 '21 at 16:29
1

**join()** blocks the current thread, not all the other threads. So calling **join()** for all threads is not going to slow the code down. Other threads are going to work in parallel. – Ilya Sazonov May 09 '21 at 06:43