0

I've tested some simple conditions:
Consider an int array with 10 000 000 in length. Filling:

  1. using single (main) thread.
  2. using double-worker threads and joining them until they finish. First one fills from the beginning until the middle of array. Second one from the end until the middle.
  3. using ExecutorService fixed pool (2), calling execute and waiting for termination.
  4. using ForkJoinPool with default amount of workers (amount of available processors)

import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.ForkJoinPool;
import java.util.concurrent.RecursiveAction;
import java.util.concurrent.TimeUnit;

public class PerformanceTest {
    private static final int ARRAY_LENGTH = 10_000_000;
    private static int[] array;
    private static final int ITERATIONS = 10;

    public static void main(String[] args) throws InterruptedException {
        for (int i = 0; i < ITERATIONS; i++) {
            array = new int[ARRAY_LENGTH];
            long millis = System.currentTimeMillis();
            singleWorkerFill();
            System.out.println("Single worker: " + (System.currentTimeMillis() - millis));

            array = new int[ARRAY_LENGTH];
            millis = System.currentTimeMillis();
            doubleWorkerFill();
            System.out.println("Double worker: " + (System.currentTimeMillis() - millis));

            array = new int[ARRAY_LENGTH];
            millis = System.currentTimeMillis();
            forkJoinWorkersFill();
            System.out.println("Executor workers: " + (System.currentTimeMillis() - millis));

            array = new int[ARRAY_LENGTH];
            millis = System.currentTimeMillis();
            executorWorkersFill();
            System.out.println("ForkJoin workers: " + (System.currentTimeMillis() - millis));

            System.out.println("---------------------------------------------");
            Thread.sleep(1000);
        }
    }

    private static void singleWorkerFill() {
        for (int i = 0, len = array.length; i < len; i++) {
            array[i] = i;
        }
    }

    private static void doubleWorkerFill() throws InterruptedException {
        Thread worker1 = new Thread(new HeadArrayFiller());
        Thread worker2 = new Thread(new TailArrayFiller());
        worker1.start();
        worker2.start();
        worker1.join();
        worker2.join();
    }

    private static void executorWorkersFill() throws InterruptedException {
        ExecutorService executorService = Executors.newFixedThreadPool(2);
        executorService.execute(new HeadArrayFiller());
        executorService.execute(new TailArrayFiller());
        executorService.shutdown();
        executorService.awaitTermination(Long.MAX_VALUE, TimeUnit.MILLISECONDS);
    }

    private static void forkJoinWorkersFill() throws InterruptedException {
        ForkJoinPool pool = new ForkJoinPool();
        pool.invoke(new HeadArrayFiller());
        pool.invoke(new TailArrayFiller());
        pool.shutdown();
        pool.awaitTermination(Long.MAX_VALUE, TimeUnit.MILLISECONDS);
    }

    private static class HeadArrayFiller extends RecursiveAction implements Runnable {
        @Override
        public void run() {
            for (int i = 0, middle = array.length / 2; i <= middle; i++) {
                array[i] = i;
            }
        }

        @Override
        protected void compute() {
            run();
        }
    }

    private static class TailArrayFiller extends RecursiveAction implements Runnable {
        @Override
        public void run() {
            for (int i = array.length - 2, middle = array.length / 2; i > middle; i--) {
                array[i] = i;
            }
        }

        @Override
        protected void compute() {
            run();
        }
    }
}

I expected that single-threaded model has no chances vs. another ones, but it's not. Here's test results scaled in milliseconds:

ITERATION #1
Single worker: 7
Double worker: 10
Executor workers: 11
ForkJoin workers: 6
ITERATION #2
Single worker: 6
Double worker: 4
Executor workers: 5
ForkJoin workers: 4
ITERATION #3
Single worker: 4
Double worker: 4
Executor workers: 5
ForkJoin workers: 4
ITERATION #4
Single worker: 5
Double worker: 5
Executor workers: 5
ForkJoin workers: 4
ITERATION #5
Single worker: 5
Double worker: 5
Executor workers: 4
ForkJoin workers: 5
ITERATION #6
Single worker: 4
Double worker: 4
Executor workers: 5
ForkJoin workers: 4
ITERATION #7
Single worker: 4
Double worker: 4
Executor workers: 4
ForkJoin workers: 5
ITERATION #8
Single worker: 4
Double worker: 4
Executor workers: 4
ForkJoin workers: 5
ITERATION #9
Single worker: 4
Double worker: 4
Executor workers: 4
ForkJoin workers: 5
ITERATION #10
Single worker: 5
Double worker: 4
Executor workers: 4
ForkJoin workers: 4

As you noticed single-threaded model faster than multi-threaded double-worker at startup. Fork-join model seems to be the best likewise ExecutorService. I suggest there're some JIT compiler optimizations over iterations. They are all quite similar at the end of the test.

Anyway the main question is why double-threaded model performance is same as single-threaded (and even slower at the cold startup). And how can I reach performance near twice faster as expected?

Thanks

  • You probably need a much larger example to see a big difference between `parallel` and `serial` processing... – brso05 Nov 18 '15 at 19:29

1 Answers1

1

Initializing 10M integers, for a modern computer, is a very fast task, and the gain of doing things in parallel, on two separate cores, doesn't compensate (or just compensates) the overhead of starting threads, context-switching between them, coordonating them, etc.

Start doing more more work at each iteration (like sleeping for 5 ms., for example), and the advantage of multi-threading will start to appear.

JB Nizet
  • 678,734
  • 91
  • 1,224
  • 1,255
  • Just a few other possibilities: 1. the operation has further optimization opportunities with vectorization (SIMD instructions); 2. the optimizations above may or may not be successfully applied to "double thread" code because those loops are not so simple; 3. depending on minute details of memory access patterns we may get reduced performance from cache line sharing between cores; 4. the above is heavily impacted by whether the two threads run on the same hyperthreaded core or not; and on and on... – Marko Topolnik Nov 18 '15 at 19:37
  • @MarkoTopolnik this comment should be an answer. You would get my upvote. You know way more about low-level JIT and hardware stuff than I know. – JB Nizet Nov 18 '15 at 19:42
  • 1
    It was supposed to be one, but close votes won the race. Thanks for the virtual upvote, though – Marko Topolnik Nov 18 '15 at 19:45