1

I'm playing with parallel execution in Java, now. Yesterday I tried to measure execution time and got some unclear results.

Task : sum array using parallel mode and sequential. Here is my code for parallel:

public static int sumArrayParallel(int[] numbers) throws ExecutionException, InterruptedException {
    int cpus = Runtime.getRuntime().availableProcessors();
    ExecutorService service = Executors.newFixedThreadPool(cpus);
    List<FutureTask<Integer>> tasks = new ArrayList<>();
    int blockSize = (numbers.length + cpus - 1) / cpus;

    for (int i = 0; i < numbers.length; i++) {
        final int start = blockSize * i;
        final int end = Math.min(blockSize * ( i + 1 ), numbers.length);
        FutureTask<Integer> futureTask = new FutureTask<Integer>(new Callable<Integer>() {
            public Integer call() {
                int sub = 0;
                for (int j = start; j < end; j++)
                    sub += numbers[j];
                return sub;
            }
        });
        tasks.add(futureTask);
        service.execute(futureTask);
    }
    int sum = 0;
    for(Future<Integer> task: tasks)
        sum += task.get();
    service.shutdown();        
    return  sum;
}

And pretty simple for sequential:

  public static int sumArraySequential(int[] arr) {
    int sum = 0;
    for( int num : arr ) {
        sum += num;
    }
    return  sum;
};

So, sequential function works 2-4 times faster than parallel.What I'm doing wrong?

YoungHobbit
  • 13,254
  • 9
  • 50
  • 73
TK-95
  • 1,141
  • 2
  • 10
  • 21

4 Answers4

1

Summing, processing wise, is a really simple task. Addition is one CPU cycle.

Getting data out of memory is a very expensive task. Depending on your array size, it probably lives in main memory and not in any of the L1, L2, L3 caches. Getting data out of main memory takes hundreds of CPU cycles.

Now when you do the summing sequentially, on a single thread, the CPU assumes that you will need more memory from the part that you're processing and loads it preemptively in the L1/L2/L3 caches. This optimization basically completely nullifies the "hundreds of CPU cycles" to get the data from main memory, because the data is already in the cache by the time you want to sum it.

When you now try to parallelize the task, you're splitting up the array into multiple chunks. The optimizer doesn't know which parts to load into the cache, because they can be executed out of order. For the parallel tasks, you will probably not have data in the cache already, resulting in having to wait for the hundreds of CPU cycles to get the data from main memory.

So in the end, your task is not limited by the amount of processing your CPU can do (which is increased by parallelization) but the amount and quickness of getting the data from memory (which is easier to optimize in a single sequential program). This likely explains your "unexpected" results.

Also, depending on your input size, the initialization of the threads takes more time than processing, but I can only assume that you're using large array sizes so that doesn't matter much.

Timo Türschmann
  • 4,388
  • 1
  • 22
  • 30
1

In the sequential version you only use primitives, which is inherently fast.

In the parallel, or concurrent version, you create a number of objects, which incur overheads both in creation and in use.

You don't say what array sizes you tested this with. I would guess the performance would be relatively better for larger values of numbers.length.

teraspora
  • 415
  • 4
  • 17
1

Your code is not right.

You're creating N elements tasks, while you should be creating M blocks tasks. :-)

Fix your main loop

for (int i = 0; i < numbers.length; i++) {

to iterate on the blocks, not on the elements.

ps. if you change your code a little bit, you'll clearly see what's happening

    int sum = 0;
    for(Future<Integer> task: tasks) {
        sum += task.get();
        System.out.println(sum);
    }
Leo
  • 751
  • 4
  • 29
0

First off as Leo says you have to fix your loop so that you do not create numbers.length threads.

Secondly as other people also state your sequential solution may be faster because of your inputsize and because you perhaps also measure task creation.

To get better results measuring what you want I propose that you:

  • Make sure the computer you are using have enough cores to actually do a parallel run that would be faster
  • Use a large input array (millions of elements at least).
  • Take a https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/CyclicBarrier.html It is a class that provides you with options of waiting for threads to "meet up" and even "rendezvous" later on. Say you agree to meet with your 5 friends at 9:00. You then wait for your friends and then you may all be gathered at 9:00 or perhaps at 9:05, in any case you wait. Then you might agree to do separate things and meet again at 11:00 and so on. This is useful for you as you can put a barrier at:

Create a CyclicBarrier, and put barrier.await() as the first statement of your call method. Then in your main method you also call barrier.await() and right after every thread has reached the barrier you start your benchmarking. This way you will don't have to measure thread creation and start performance though this may in fact be relevant to you! That depends on the semantics of your problem.

PNS
  • 750
  • 1
  • 5
  • 19