Java Concurrent - no speedUp gained LU algorithm - False sharing?

Question

I have a question for you about a concurrent software on java.

The main algorithm of my application have to factor an A matrix into LU matrices : A = LU. The decompose method pasted here, makes the Gauss-Jordan reduction. The software is designed to work on square matrix with the position A[0][0] != 0.

Unfortunately, for the correct working of the algorithm, I have to wait that every row actualizes the values.

I tried to make this algorithm parallel using a barrier (for wait the actualization of every row and increment the "rigaCorrente" value) but I don't gain a real speedUp even if in the parallel version my processors (2.4 GHz Core 2 Duo P8600) work at 80% of the total (instead of 40-50% of the serial one).

My fear is that I'm encountering a false sharing situation. Is the problem related to the false sharing or to other matters? I know that the JVM performs good optimizations but it still uses the half power of the processor. I don't think it's impossible improve the algorithm!

My serial code:

    public void decompose(){
    int n = A.length;
    for(int k=0; k<n-1;k++) {
        for(int i=k+1; i<n; i++) {
            A[i][k] = A[i][k]/A[k][k];
            for(int j=k+1; j<n; j++) {
                A[i][j] = A[i][j] - A[i][k] * A[k][j];
            }
        }
    }
    decomposed = true;
}

My parallel code:

    public class Manager {
private double [][] A;

private Semaphore main = new Semaphore(0);
private CyclicBarrier barriera;

private AtomicInteger index;
private int rigaCorrente = 0;
private boolean stop = false;

public Manager(final double A[][], final int numThr){
    this.A = A;
    this.index = new AtomicInteger(1);

    barriera = new CyclicBarrier(numThr, new Runnable(){

        @Override
        public void run() {

        rigaCorrente++;
        index = new AtomicInteger(rigaCorrente+1);

    if(rigaCorrente == A.length - 1){
            setStop(true);
            main.release();
        }
    }

    });
}

The thread class:

    public class Deco implements Runnable {

private Manager manager;

public Deco(Manager manager){
    this.manager = manager;
}

@Override
public void run() {
    double [][] A = manager.getA();

while(manager.getStop() == false){
    int i;
    while((i = (manager.getIndex().getAndIncrement())) < (A.length)){
    double pivot = A[i][manager.getRigaCorrente()]/A[manager.getRigaCorrente()]                        [manager.getRigaCorrente()];

        for(int k = manager.getRigaCorrente(); k<A.length; k++)
            A[i][k] = A[i][k] - (pivot*A[manager.getRigaCorrente()][k]);

        A[i][manager.getRigaCorrente()] = pivot;
 }

        manager.acquireBarriera();

    }// Stop

}

   }

Main for parallel code:

    package luConcurrent.test;
    import java.util.Arrays;
    import java.util.Scanner;

    import lu.DecompositionLU;
    import lu.IO;

    public class Starter {
private static IO io;
private static DecompositionLU dec;
public static void main(String[] args) throws Exception {
    io = new IO("//Users//black//Desktop//serie//2500.txt");
    int numThr = 2;

    double [][] A = io.readMatrixFromInputFile();
    double [] b = io.readArrayFromInputFile();
    double [] x;

    dec = new DecompositionLU(A);

    System.out.println("A.length: "+A.length);

    Manager manager = new Manager(A,numThr);
    Thread[] pool = new Thread[numThr];

    for(int i=0; i<pool.length; i++){
        pool[i] = new Thread(new Deco(manager));
    }

    long inizio = System.nanoTime();

    for(int i = 0; i<pool.length; i++){
        pool[i].start();
    }

    manager.getMain().acquire();

    dec.ProgresiveSustitution(b);
    x = dec.RegresiveSustitution(b);

    long fine = System.nanoTime()-inizio;

    System.out.println("Solution is: "+Arrays.toString(x));
    Scanner sc = new Scanner(System.in);
    sc.nextLine();
    System.out.println("Tempo: "+fine);
    sc.close();

}
    }

Results:

    1000x1000 Serial: 1154679000 nanoSec
    1000x1000 Parallel 2 Threads: 1135663000 nanoSec

    1750x1750 Serial: 7502559000 nanoSec
    1750x1750 Parallel 2 Threads: 6667129000 nanoSec

    4000x4000 Serial: 89851311000 nanoSec
    4000x4000 Parallel 2 Threads: 84348616000 nanoSec

are you having a large amount of L2 cache misses? If false sharing is the case, you could pass each thread its own copy of the array to completely avoid overwriting each core's L2 cache. — anguyen, Dec 03 '13 at 14:15
I don't know if I have a large amount of misses, how should I check it? — BlacK, Dec 03 '13 at 16:16
Since you're running on Intel hardware, you can probably use VTune to detect false sharing. See Intel's article [Avoiding and Identifying False Sharing Among Threads](http://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threads) — Ted Hopp, Dec 03 '13 at 16:21
A good example why using English variable names is always a good idea. ;) — TwoThe, Dec 04 '13 at 12:16

score 2 · Accepted Answer · answered Dec 04 '13 at 03:07

I wouldn't jump to the conclusion that false sharing is occurring. The parallel version of the algorithm adds a bunch of overhead that is probably responsible for reducing its performance below what you're expecting.

The serial version simply has three nested loops: an outer loop over k, a middle loop over i, and an inner loop over j. All it does is array access and arithmetic, so this should be quite fast.

The parallel version runs its outer loop over rigaCorrente (current row, if I'm not mistaken) using a CyclicBarrier at each iteration. This adds overhead. A cyclic barrier causes early-arriving threads to wait until the last has arrived. Well, you have only two threads, so the one that arrives first must wait for the second. That's some dead time there. Even if the threads finish at about the same time, there is overhead performing the barrier synchronization. And then one thread must wait while the other runs the barrier action.

The middle loop is over index which is an AtomicInteger fetched by the getIndex method call. The method call adds overhead, and getAndIncrement adds some contention between the threads.

The inner loop (confusingly over k instead of j as in the serial version) has a method call to getRigaCorrente inside of it. Sometimes -- but only sometimes -- the JVM can inline method calls. I don't see the implementation of getRigaCorrente here, but since rigaCorrente is a private instance variable of manager that isn't volatile, and it's read and written by multiple threads, perhaps this method is synchronized. That would add even more overhead.

The problem here is that the threads have to interact with shared state quite a bit during the run. This adds overhead and contention. I'd suggest trying to find a way to partition the work among threads in advance, then tell each thread what work it needs to do, and then have them all run independently.

The rigaCorrente is only red by threads and written by the Runnable inside the CyclicBarrier **numThr** threads call **manager.acquireBarriera()**. For the partitioning work, I tried this approach creating threads with two int field "start" and "end". The real problem is that if the matrix (let we say of length 100) is decomposed from 1 to 50, all the threads who have "end" field <= 50 are not working. For what I know, I need to communicate the "rigaCorrente" field in some way. — BlacK, Dec 04 '13 at 15:35
Even though rigaCorrente might only be read and written by threads at well-understood times, because of the Java memory model, one **must** either lock around all accesses to it (reads and writes) or make it volatile. The full answer is too detailed for a comment, but one way to think of it is that writes can be cached and won't necessarily be visible to other threads, unless locking or volatile is used. — Stuart Marks, Dec 05 '13 at 00:41
Regarding splitting work among threads, this has to be done in an application-specific fashion. I don't know much about LU-decomposition, but it might be that the "obvious" way of splitting the matrix in half doesn't work, because, for example, work lower down in the matrix depends on work done above it in the matrix. Having threads contend over rigaCorrente may prevent such problems, at the cost of effectively having multiple threads run sequentially -- hence resulting in no speedup. I bet there are research papers in how to parallelize LU decomposition. — Stuart Marks, Dec 05 '13 at 00:45
The instruction **getAndIncrement** shouldn't be realized by "teaching" the processor to do the load-op-store loop without preempt? Is it so heavy? — BlacK, Dec 06 '13 at 21:54
The intent is that **getAndIncrement** is implemented by the JDK in terms of hardware instructions like compare-and-swap and that it be cheaper than lock-load-op-store-unlock that you'd otherwise have to write in Java. But it still does involve some interaction between the threads, which adds overhead. If two threads do this operation at the same time, one will have to wait (probably spin-waiting) for the other, which means it's not doing useful work. I don't know how much overhead this operation is in your application, though. — Stuart Marks, Dec 07 '13 at 01:54

brettw · Answer 2 · 2013-12-04T14:03:20.350

1

COMPLETE REWRITE OF ANSWER:

Here is my solution using the Fork/Join framework:

public void decompose()
{
    final Semaphore semaphore = new Semaphore(0);

    class Decompose extends RecursiveAction {
        private final int k;

        Decompose(int k) {
            this.k = k;
        }

        protected void compute() {
            final int n = A.length;
            for (int i = k + 1; i < n; i++) {
                A[i][k] = A[i][k] / A[k][k];
                for (int j = k + 1; j < n; j++) {
                    A[i][j] = A[i][j] - A[i][k] * A[k][j];
                }
            }

            semaphore.release();
        }
    };

    ForkJoinPool mainPool = new ForkJoinPool();
    for (int k = 0; k < A.length - 1; k++) {
        mainPool.execute(new Decompose(k));
    }
    semaphore.acquireUninterruptibly(A.length - 1);
}

In my case, these are the number I got for a 1000x1000 matrix:

1000x1000 Serial   : 234351000 nanoSec
1000x1000 Fork/Join:  61580000 nanoSec

This is on a iMac i7 2600.

edited Dec 04 '13 at 14:03

answered Dec 03 '13 at 14:07

brettw

10,664
2
42
59

ForkJoin isn't magically making things faster, actually rather the opposite. See [my analysis](http://stackoverflow.com/questions/20288379/analysis-performance-of-forkjoinpool) on SO. – TwoThe Dec 04 '13 at 12:17
I didn't say it would make it faster. I said it would likely be less complicated code-wise. BTW, [this](http://coopsoft.com/ar/CalamityArticle.html) is a much better analysis of all that is wrong with the Fork/Join framework in Java. – brettw Dec 04 '13 at 12:44
Thanks for the very interesting link. But then: why do you still recommend it? ;) – TwoThe Dec 04 '13 at 13:11
The code you posted doesn't work properly because there is incoherence with the output (and the output changes for every run). The problem is that you have to wait that every row have performed the operation to increment the **k** value, that's why I used the barrier. – BlacK Dec 04 '13 at 14:58
I'll update the code in the "morning" (it's after midnight here in Tokyo). Or you can take a stab at it with the Fork/Join framework based on the code above. – brettw Dec 04 '13 at 15:22
I still have problems with threads synch... Any hint for me? :) – BlacK Dec 05 '13 at 13:12

score -1 · Answer 3 · answered Dec 04 '13 at 12:28

There are several things wrong with your time measurement. First of all you include the start() call in your measurement, which would be ok if you want to know the time if you do not reuse those threads, but then you must include the creation overhead as well. If you plan on reusing the threads, you must leave the start call outside of the time measurement, as it is much more than a simple boolean set to true.

The other issue is that the serial version is much easier to optimize for the JVM, so what you measure here might already be the optimized version vs the probably not-yet-optimized threaded version. If you want to make sure that you get the optimized version in both cases, you need to warm-up the JVM first by doing several dry runs.

Then the manager.getRigaCorrente() call looks very suspicious. In your Manager class this is not defined, but an int with the same name is present. Whatever you get from there repeatedly will most likely either violate threading rules or - if you properly synchronized it - will slow things down tremendously, so there seems to be an algorithmic problem as well. Hard to tell if you do not give us the real code used.

Java Concurrent - no speedUp gained LU algorithm - False sharing?

3 Answers3