Java Multithreaded vector addition

Question

I am trying to get familiar with java multithreaded applications. I tried to think of a simple application that can be parallelized very well. I thought vector addition would be a good application to do so. However, when running on my linux server (which has 4 cores) I dont get any speed up. The time to execute on 4,2,1 threads is about the same.

Here is the code I came up with:

   public static void main(String[]args)throws InterruptedException{

    final int threads = Integer.parseInt(args[0]);
    final int length= Integer.parseInt(args[1]);
    final int balk=(length/threads);
    Thread[]th = new Thread[threads];

    final double[]result =new double[length];

    final double[]array1=getRandomArray(length);
    final double[]array2=getRandomArray(length);

    long startingTime =System.nanoTime();
    for(int i=0;i<threads;i++){
        final int current=i;
        th[i]=new Thread(()->{
           for(int k=current*balk;k<(current+1)*balk;k++){
               result[k]=array1[k]+array2[k];
           }
        });
        th[i].start();
    }
    for(int i=0;i<threads;i++){
        th[i].join();
    }
    System.out.println("Time needed: "+(System.nanoTime()-startingTime));


}

length is always a multiple of threads and getRandomArray() creates a random array of doubles between 0 and 1.

Execution Time for 1-Thread: 84579446ns
Execution Time for 2-Thread: 74211325ns
Execution Time for 4-Thread: 89215100ns
length =10000000

Here is the Code for getRandomArray():

    private static double[]getRandomArray(int length){
    Random random =new Random();
    double[]array= new double[length];
    for(int i=0;i<length;i++){
        array[i]=random.nextDouble();
    }
    return array;
}

I would appreciate any help.

I run a test with length=10000000 and cant see any speedup. See my edit below the code. — CheckersGuy, Jan 03 '17 at 16:09
That's no way to do a benchmark. See http://stackoverflow.com/q/504103/4125191 — RealSkeptic, Jan 03 '17 at 16:25

score 2 · Answer 1 · answered Jan 03 '17 at 16:35

The difference is observable for the following code. Try it.

public static void main(String[]args)throws InterruptedException{

    for(int z = 0; z < 10; z++) {

        final int threads = 1;
        final int length= 100_000_000;
        final int balk=(length/threads);
        Thread[]th = new Thread[threads];

        final boolean[]result =new boolean[length];

        final boolean[]array1=getRandomArray(length);
        final boolean[]array2=getRandomArray(length);

        long startingTime =System.nanoTime();
        for(int i=0;i<threads;i++){
            final int current=i;
            th[i]=new Thread(()->{
                for(int k=current*balk;k<(current+1)*balk;k++){
                    result[k]=array1[k] | array2[k];
                }
            });
            th[i].start();
        }
        for(int i=0;i<threads;i++){
            th[i].join();
        }

        System.out.println("Time needed: "+(System.nanoTime()-startingTime)*1.0/1000/1000);

        boolean x = false;
        for(boolean d : result) {
            x |= d;
        }
        System.out.println(x);
    }
}

First things first you need to warmup your code. This way you will measure compiled code. The first two iterations have the same(approximately) time but the next will differ. Also I changed double to boolean because my machine doesn't have much memory. This allows me to allocate a huge array and it also makes work more CPU consuming.

There is a link in comments. I suggest you to read it.

Thanks a lot. I run the test and could finally see a speedup :P I just got one more question. Is one reason we have to "warm-up" first that the JIT-Compiler optimizes as we execute code ? — CheckersGuy, Jan 03 '17 at 16:54
@CheckersGuy exactly. First you code is executed with help of interpreter. This is very slow. After some parts of your code will be executed throusands of times(in your case it's a loop where all work is done) it gets compiled. By default, compilation has a few levels(tiered compilation) so you need to call your code a lot till the last level will be applied. — Andrey Cheboksarov, Jan 03 '17 at 17:02

score 2 · Answer 2 · answered Jan 03 '17 at 18:10

Hi from my side if you are trying to see how your cores shares work you can make very simple task for all cores, but make them to work constantly on something not shared across different threads (basically to simulate for example merge sort, where threads are working on something complicated and use shared resources in a small amount of time). Using your code i did something like this. In such case you should see almost exactly 2x speed up and 4 times speed up.

public static void main(String[]args)throws InterruptedException{
        for(int a=0; a<5; a++) {
            final int threads = 2;
            final int length = 10;
            final int balk = (length / threads);
            Thread[] th = new Thread[threads];
            System.out.println(Runtime.getRuntime().availableProcessors());
            final double[] result = new double[length];

            final double[] array1 = getRandomArray(length);
            final double[] array2 = getRandomArray(length);

            long startingTime = System.nanoTime();
            for (int i = 0; i < threads; i++) {
                final int current = i;
                th[i] = new Thread(() -> {
                    Random random = new Random();
                    int meaningless = 0;
                    for (int k = current * balk; k < (current + 1) * balk; k++) {
                        result[k] = array1[k] + array2[k];
                        for (int j = 0; j < 10000000; j++) {
                            meaningless+=random.nextInt(10);
                        }
                    }
                });
                th[i].start();
            }
            for (int i = 0; i < threads; i++) {
                th[i].join();
            }
            System.out.println("Time needed: " + ((System.nanoTime() - startingTime) * 1.0) / 1000000000 + " s");

        }
    }

You see, in your code most time is consumed by building big table, and then threads are executing very fast, their work is so fast that your calculation of time is wrong because most of time is consumed by creating threads. When i invoked code which works on precalculated loop like this:

    long startingTime =System.nanoTime();
    for(int k=0; k<length; k++){
        result[k]=array1[k]|array2[k];
    }
    System.out.println("Time needed: "+(System.nanoTime()-startingTime));

It worked two times faster than your code with 2 threads. I hope that you understand what i mean in this case and will see my point when i gave my threads much more meaningless work.

Java Multithreaded vector addition

2 Answers2