Java sequential implementation is 4 times faster than parallel implementation

Question

I created a really simple scenario where I recognized a really weird behavior which I cant understand.

Under following link I created an sequential implementation: http://ideone.com/B8JYeA Basically there are several big arrays with fixed size. The algorithm iterates through them and changes the value.

for(int i = 0; i < numberOfCells; i++) {
    h0[i] =  h0[i] + 1;
    h1[i] =  h1[i] + 1;
    h2[i] =  h2[i] + 1;
    h3[i] =  h3[i] + 1;
    h4[i] =  h4[i] + 1;
}

If I run it on my workstation it takes around 5 seconds.

I implemented the same in a parallel version. And 8 threads run it simultaneously. The code should be thread safe and there is no dependency between the threads.

But still the code runs around 4 times slower on my workstation: http://ideone.com/yfwVmr

final int numberOfThreads = Runtime.getRuntime().availableProcessors();

ExecutorService exec = Executors.newFixedThreadPool(numberOfThreads);

for(int thread = 0; thread < numberOfThreads; thread++) {
    final int threadId = thread;
    exec.submit(new Runnable() {
        @Override
        public void run() {
            for(int i = threadId; i < numberOfCells; i += numberOfThreads) {
                h0[i] =  h0[i] + 1;
                h1[i] =  h1[i] + 1;
                h2[i] =  h2[i] + 1;
                h3[i] =  h3[i] + 1;
                h4[i] =  h4[i] + 1;
            }
        }
    });
}

exec.shutdown();

Does anyone have an idea why this happens?

Edit: This problem differs to others because the reason why is probably a Caching Problem. How can I solve this caching problem?

This was closed rather quickly. The other question is very unspecific, like "sometimes something is slower". *Here*, one could expect more interesting answers.... — Marco13, May 05 '15 at 16:35
Reopened the question as it is looking for something much more specific to the code in question. — Peter Lawrey, May 05 '15 at 17:17

Peter Lawrey · Accepted Answer · 2015-05-05T17:15:08.917

The biggest overhead is the time spent starting and stopping the threads. If I reduce the size of the array to 10 from 10000 it takes about the same amount of time.

If you keep the thread pool, and divide the work for each thread to write to a local data set, it is 4x faster on my machine with 6 cores.

import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.*;


public class ParallelImplementationOptimised {
    static final int numberOfThreads = Runtime.getRuntime().availableProcessors();
    final ExecutorService exec = Executors.newFixedThreadPool(numberOfThreads);

    private int numberOfCells;

    public ParallelImplementationOptimised(int numberOfCells) {
        this.numberOfCells = numberOfCells;
    }

    public void update() throws ExecutionException, InterruptedException {

        List<Future<?>> futures = new ArrayList<>();
        for(int thread = 0; thread < numberOfThreads; thread++) {
            final int threadId = thread;
            futures.add(exec.submit(new Runnable() {
                @Override
                public void run() {
                    int num = numberOfCells / numberOfThreads;
                    double[] h0 = new double[num],
                            h1 = new double[num],
                            h2 = new double[num],
                            h3 = new double[num],
                            h4 = new double[num],
                            h5 = new double[num],
                            h6 = new double[num],
                            h7 = new double[num],
                            h8 = new double[num],
                            h9 = new double[num];
                    for (int i = 0; i < num; i++) {
                        h0[i] = h0[i] + 1;
                        h1[i] = h1[i] + 1;
                        h2[i] = h2[i] + 1;
                        h3[i] = h3[i] + 1;
                        h4[i] = h4[i] + 1;
                        h5[i] = h5[i] + 1;
                        h6[i] = h6[i] + 1;
                        h7[i] = h7[i] + 1;
                        h8[i] = h8[i] + 1;
                        h9[i] = h9[i] + 1;
                    }
                }
            }));
        }
        for (Future<?> future : futures) {
            future.get();
        }
    }

    public static void main(String[] args) throws ExecutionException, InterruptedException {

        ParallelImplementationOptimised si = new ParallelImplementationOptimised(10);

        long start = System.currentTimeMillis();

        for (int i = 0; i < 10000; i++) {
            if(i % 1000 == 0) {
                System.out.println(i);
            }
            si.update();
        }

        long stop = System.currentTimeMillis();
        System.out.println("Time: " + (stop - start));
        si.exec.shutdown();
    }

}

SequentialImplementation 3.3 sec. ParallelImplementationOptimised 0.8 sec.

You appear to be writing to the same data on the same cache line. This means the data has to pass via an L3 cache miss which takes 20x longer than an access to L1 cache. I suggest you try completely separate data structures which are at least 128 bytes apart to be sure you are not touching the same cache line.

Note: even if you intended to complete overwrite a whole cache line, x64 CPUs will pull in the previous values of the cache line first.

Another question might be

Why isn't this 20x slower?

The CPU core which has grabbed the cache line might have two threads running with hyper threading (i.e. two threads can access the data locally), and that CPU might go around the loop a few times before it loses the cache line to another CPU core which is demanding it. This means the 20x penalty is not on every access or on every loop but often enough that you get a much slower result.

Could you provide some more explanation as to why each single(?) write to a cache line would have to pass through to the L3 before the next calculation? (Does this depend on whether or not `volatile` is used?) — JimmyB, May 05 '15 at 16:35
@HannoBinder if every access did got via the L3 it would be 20x slower or more. `volatile` won't help really as you can't have volatile elements of an array, only a volatile reference to an array. In theory the JIT could optimise the loop to just execute the last iteration and use of volatile might prevent that, but in this case it doesn't eliminate the loop anyway. — Peter Lawrey, May 05 '15 at 16:39
Apparently you are correct about only the *reference* being volatile. I'm not sure though if `volatile` wouldn't still yield memory barriers for each access to the array. So if comparing a (single-threaded) implementation without `volatile` against (a multi-threaded) one with (unnecessary) `volatile` I figure it may make a difference. — JimmyB, May 05 '15 at 16:48
@HannoBinder there is AtomicIntergerArray which supports volatile access on every element. — Peter Lawrey, May 05 '15 at 16:50
I wonder how one should *enforce* that data being on different cache lines. The JVM spec explicitly says (!) that it does not say anything (!) about the memory layout. A quick test, e.g. with 10 threads and `h[10][numCells]` ("definitely" being far apart) did not show any speedup. In contrast to that, maintaining the locality (as Hanno Binder suggested) *does* make it faster (though not as fast as a sequential implementation - which has many reasons, primarily the memory bottleneck, but I don't see how this could be avoided or alleviated by speculating about the caching...). — Marco13, May 05 '15 at 16:51
@Marco13 arrays in the jvm are continous in memory so if you use int [0] and int [16] they will be 64 bytes apart. The problem is that you need to ensure the first access of one block is at least 64 bytes after the last one. — Peter Lawrey, May 05 '15 at 16:55
@Marco13 the cache is the only shared resource. If change the code to allocate an array in each thread it should be faster. — Peter Lawrey, May 05 '15 at 16:56
@Peter Does your answer imply that the OP should try `h0_new[i] = h0[i] + 1` instead of `h0[i] = h0[i] + 1`? — JimmyB, May 05 '15 at 17:02
That is, avoid writing to a cache line which already is in L1 anyway, and instead write to another cache line which will first be fetched from memory only to be completely overwritten? — JimmyB, May 05 '15 at 17:09
@Marco13 you were right that the cache wasn't the main case, it was in fact the time spent starting and stopping threads. See my updated answer. — Peter Lawrey, May 05 '15 at 17:16
@HannoBinder see my updated answer where each thread has its own data. — Peter Lawrey, May 05 '15 at 17:16
For me, the time of your current solution (which dramatically changes the structure, and thus may not even be applicable for the original problem) is somewhere between the slowest and the fastest parallel version that I tried during my random tests: It takes ~9000ms, and when changing the `numCells` to 10, it takes ~100ms, so I don't see such a strong influence of the thread creation either. This, in turn, might also depend on the JVM (version) and the OS. No offense, but... for me, this sounds like a lot of dubious guesses floating around here... — Marco13, May 05 '15 at 17:24

score 0 · Answer 2 · edited May 23 '17 at 11:51

0

Not really an answer, but: First I'd try to maintain the locality of data access where possible:

final int numberOfCellsPerThread = numberOfCells / numberOfThreads;

public void run() {
    final int start = threadId * numberOfCellsPerThread;
    final int end = start + numberOfCellsPerThread;
    for(int i = start; i < end; i++) {
        h0[i] =  h0[i] + 1;
        h1[i] =  h1[i] + 1;
        h2[i] =  h2[i] + 1;
        h3[i] =  h3[i] + 1;
        h4[i] =  h4[i] + 1;
    }
}

For more explanation about why locality matters see for example Why does cache locality matter for array performance? or http://en.wikipedia.org/wiki/Locality_of_reference.

Basically it's just about using data that's already in the cache where possible. Since the cache is limited in size, if a[i] is already in the cache, e.g. due to a previous read operation, the chance that a[i+1] is in the cache too is fairly high. At least higher than the chance of a[i+100] for example.

Besides, sequential reads from memory can potentially be optimized into bursts by the hardware, and are the easiest to predict by prefetching logic.

edited May 23 '17 at 11:51

Community

1
1

answered May 05 '15 at 16:28

JimmyB

12,101
2
28
44

You want to make sure the data is on different cache lines. i.e. at least 64-128 bytes apart. – Peter Lawrey May 05 '15 at 16:29
I ran quite some tests now, slicing and dicing the data in various ways and passing it to the executor service in various ways, and from all my approaches, the simple, pragmatic solution of maintaining the locality (similar to what you suggested) basically was the fastest. I already upvoted it, but it could become a "real" answer if you shortly explained why locality is important, and maybe corrected/extended the code to be copy+paste-able (maybe as an MVCE, but at least pointing out that `numberOfCells` should actually be `numberOfCellsPerThread`, for example) – Marco13 May 05 '15 at 18:32
Thanks for taking the time to do the testing. - I really overlooked the `numberOfCellsPerThread` issue, thanks for pointing it out. – JimmyB May 06 '15 at 10:40
Now, one could add some hints to be careful for the case that `numberOfCells` is not a multiple of `numberOfThreads`, which would cause the last few cells to be skipped from the computation, but I think that the idea now becomes clear(er), and such details may be left to the implementer. – Marco13 May 06 '15 at 11:30

Java sequential implementation is 4 times faster than parallel implementation

2 Answers2