Java parallel execution slower than sequental

Question

I'm playing with parallel execution in Java, now. Yesterday I tried to measure execution time and got some unclear results.

Task : sum array using parallel mode and sequential. Here is my code for parallel:

public static int sumArrayParallel(int[] numbers) throws ExecutionException, InterruptedException {
    int cpus = Runtime.getRuntime().availableProcessors();
    ExecutorService service = Executors.newFixedThreadPool(cpus);
    List<FutureTask<Integer>> tasks = new ArrayList<>();
    int blockSize = (numbers.length + cpus - 1) / cpus;

    for (int i = 0; i < numbers.length; i++) {
        final int start = blockSize * i;
        final int end = Math.min(blockSize * ( i + 1 ), numbers.length);
        FutureTask<Integer> futureTask = new FutureTask<Integer>(new Callable<Integer>() {
            public Integer call() {
                int sub = 0;
                for (int j = start; j < end; j++)
                    sub += numbers[j];
                return sub;
            }
        });
        tasks.add(futureTask);
        service.execute(futureTask);
    }
    int sum = 0;
    for(Future<Integer> task: tasks)
        sum += task.get();
    service.shutdown();        
    return  sum;
}

And pretty simple for sequential:

  public static int sumArraySequential(int[] arr) {
    int sum = 0;
    for( int num : arr ) {
        sum += num;
    }
    return  sum;
};

So, sequential function works 2-4 times faster than parallel.What I'm doing wrong?

why do you assume parallel to be always faster? Post your system's hardware /CPU details too. — Sabir Khan, Jan 07 '16 at 12:10
Perhaps, all this initial setup in your parallel code costs much more than execution of this simple loop, which can be executed on a CPU extremely fast. — , Jan 07 '16 at 12:13
@Leo in Java 8 you don't have to [explicitly declare final](http://stackoverflow.com/questions/20938095/difference-between-final-and-effectively-final) — Timo Türschmann, Jan 07 '16 at 12:27

score 1 · Answer 1 · answered Jan 07 '16 at 12:18

Summing, processing wise, is a really simple task. Addition is one CPU cycle.

Getting data out of memory is a very expensive task. Depending on your array size, it probably lives in main memory and not in any of the L1, L2, L3 caches. Getting data out of main memory takes hundreds of CPU cycles.

Now when you do the summing sequentially, on a single thread, the CPU assumes that you will need more memory from the part that you're processing and loads it preemptively in the L1/L2/L3 caches. This optimization basically completely nullifies the "hundreds of CPU cycles" to get the data from main memory, because the data is already in the cache by the time you want to sum it.

When you now try to parallelize the task, you're splitting up the array into multiple chunks. The optimizer doesn't know which parts to load into the cache, because they can be executed out of order. For the parallel tasks, you will probably not have data in the cache already, resulting in having to wait for the hundreds of CPU cycles to get the data from main memory.

So in the end, your task is not limited by the amount of processing your CPU can do (which is increased by parallelization) but the amount and quickness of getting the data from memory (which is easier to optimize in a single sequential program). This likely explains your "unexpected" results.

Also, depending on your input size, the initialization of the threads takes more time than processing, but I can only assume that you're using large array sizes so that doesn't matter much.

teraspora · Answer 2 · 2016-01-07T12:36:25.790

1

In the sequential version you only use primitives, which is inherently fast.

In the parallel, or concurrent version, you create a number of objects, which incur overheads both in creation and in use.

You don't say what array sizes you tested this with. I would guess the performance would be relatively better for larger values of numbers.length.

edited Jan 07 '16 at 12:36

answered Jan 07 '16 at 12:26

teraspora

415
4
17

score 1 · Answer 3 · answered Jan 07 '16 at 12:27

Your code is not right.

You're creating N elements tasks, while you should be creating M blocks tasks. :-)

Fix your main loop

for (int i = 0; i < numbers.length; i++) {

to iterate on the blocks, not on the elements.

ps. if you change your code a little bit, you'll clearly see what's happening

    int sum = 0;
    for(Future<Integer> task: tasks) {
        sum += task.get();
        System.out.println(sum);
    }

score 0 · Answer 4 · answered Jan 12 '16 at 18:58

First off as Leo says you have to fix your loop so that you do not create numbers.length threads.

Secondly as other people also state your sequential solution may be faster because of your inputsize and because you perhaps also measure task creation.

To get better results measuring what you want I propose that you:

Make sure the computer you are using have enough cores to actually do a parallel run that would be faster
Use a large input array (millions of elements at least).
Take a https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/CyclicBarrier.html It is a class that provides you with options of waiting for threads to "meet up" and even "rendezvous" later on. Say you agree to meet with your 5 friends at 9:00. You then wait for your friends and then you may all be gathered at 9:00 or perhaps at 9:05, in any case you wait. Then you might agree to do separate things and meet again at 11:00 and so on. This is useful for you as you can put a barrier at:

Create a CyclicBarrier, and put barrier.await() as the first statement of your call method. Then in your main method you also call barrier.await() and right after every thread has reached the barrier you start your benchmarking. This way you will don't have to measure thread creation and start performance though this may in fact be relevant to you! That depends on the semantics of your problem.

Java parallel execution slower than sequental

4 Answers4