Why my parallel program is slower than sequential program? I don't see process creation overhead in it

Question

My sequential java program is faster than than parallel version. Below is the code of my parallel version. Sequential version is exact same code without a Thread. I don't see that much of a process creation overhead, I am processing 2 million calculation in a thread and I have only 2 threads. But its sequential version is twice much faster than this parallel version.

In the program, I am trying to merge two sorted arrays into one large sorted array, concept from Merge Sort in parallel.

import java.util.Date;

public class MergePP {

    public static final int n = 2000000;


    public static void main(String[] args) {
        Integer[] X = new Integer[n];
        Integer[] Y = new Integer[n];
        Integer[] Z = new Integer[X.length + Y.length];

        /* input sorted lists X and Y */
        for (int i = 0; i < n; i++) {
            X[i] = i * 2;
        }

        for (int i = 0; i < n; i++) {
            Y[i] = 1 + i * 2;
        }

        Date t = new Date();

        Thread threadX = new Thread(() -> {
            int j;
            for (int i = 0; i < X.length; i++) {
                // find the position of x[i] and add to Z
                j = searchY(Y, X[i]);
                Z[i + j] = X[i];
            }
        });

        Thread threadY = new Thread(() -> {
            int m;
            for (int k = 0; k < Y.length; k++) {
                // find the position of Y[k] and add to Z
                m = searchX(X, Y[k]);
                Z[k + m] = Y[k];
            }
        });

        threadX.start();
        threadY.start();

        // wait for thread to complete
        try {
            threadX.join();
            threadY.join();
        } catch (InterruptedException e) {
            e.printStackTrace();
        }

        Date s = new Date();
        System.out.print("Elapsed Time: ");
        System.out.println(s.getTime() - t.getTime());
    }

    private static Integer searchY(Integer[] arr, int k) {

        int startIndex = 0;

        int endIndex = arr.length - 1;

        while (startIndex <= endIndex) {

            int midIndex = (startIndex + endIndex) / 2;

            if (arr[midIndex] == k) {
                // treat it like a k > arr[midIndex]
                return midIndex + 1;
            } else if (k < arr[midIndex]) {
                endIndex = midIndex - 1;
            } else if (k > arr[midIndex]) {
                startIndex = midIndex + 1;
            }
        }

        return endIndex + 1;
    }

    private static Integer searchX(Integer[] arr, int k) {

        int startIndex = 0;

        int endIndex = arr.length - 1;

        while (startIndex <= endIndex) {

            int midIndex = (startIndex + endIndex) / 2;

            if (arr[midIndex] == k) {
                // treat it like a k < arr[midIndex]
                return midIndex - 1;
            } else if (k < arr[midIndex]) {
                endIndex = midIndex - 1;
            } else if (k > arr[midIndex]) {
                startIndex = midIndex + 1;
            }
        }

        return endIndex + 1;
    }

}

Note: I am using Mac Book Pro with quad core processor.

because there is overhead in context switching that is greater than the work that needs to be done as the duplicate explains — , May 16 '17 at 06:49
BTW Using `Integer` instead of `int` can use 5x as much memory and make more difference than using multiple threads. — Peter Lawrey, May 16 '17 at 06:53
@JarrodRoberson I am processing 2 million records in each thread. I am only creating 2 thread, how there is context switching overhead? — User45, May 16 '17 at 06:54
@PeterLawrey Can be, but my sequential program also has Integer and my relative speed of parallel program almost twice slower than parallel. — User45, May 16 '17 at 06:56
@JarrodRoberson In a multicore processor, I don't think we switch context anymore. — User45, May 16 '17 at 06:57
The overhead of using multiple threads often exeeds the benefit, esp if you pass lots of data between threads. In this case you have three threads, the one which created the data where everything is local and two doing the processing of objects are are much larger than they need to be. If you look at how IntStream or similar work, it tries to minimise the number of addition threads (by using the current one for some of the work) and also support using `int` instead of Integer. — Peter Lawrey, May 16 '17 at 07:00
If you do not think there is context switching anymore you do not need to comment on concurrent programming then because your comprehension of the subject is incomplete. Research what the scheduler does with moving the threads from core to core by default. — , May 16 '17 at 07:09
@JarrodRoberson So do you have an idea on how to make that parallel program faster? What is your suggestion for a solution? — User45, May 16 '17 at 07:15
@Jarrod I seriously doubt that context switching is even remotely responsible for the performance problems here. That code looks excessively limited by the IO subsystem. Particularly I'm guessing there's a ton of false sharing involved on Z which is the real performance killer. — Voo, May 16 '17 at 09:56
@Voo - there is no I/O in the program except the final printout of the elapsed time and there is no `synchronized` keywords so there would be no blocking/contention on anything that is *shared*. this runs faster when sequential because of the lack of context switching and cache invalidations caused by the threads getting moved from one core to the other needlessly. This runs in 351ms on my laptop. Just creating the threads and cleaning up after them is probably > 50% of the elapsed time. — , May 16 '17 at 14:09
@Jarrod Ahem, yeah memory subsystem not IO - weird typo on my part. But yes cache invalidations due to false sharing when writing to Z are the most likely reason as I was saying. There should be very limited context switching going on (and modern OSes are very good at handling that) and creating 2 threads under Linux or Windows should be in the ms range (and I hope Mac OS isn't that much worse than either of those) so that's very unlikely to have anything to do with it. — Voo, May 16 '17 at 14:35
also, you should use `System.nanoTime()` for microbenchmarks like this, but the duplicate explains how to do it correctly. — , May 16 '17 at 14:37

Why my parallel program is slower than sequential program? I don't see process creation overhead in it

0 Answers0