Why does this code run faster with a lock?

Question

Some background: I created a contrived example to demonstrate use of VisualVM to my team. In particular, one method had an unnecessary synchronized keyword, and we saw threads in the thread pool blocking, where they didn't need to be. But removing that keyword had the surprising effect described below, and the code below is the simplest case I can reduce that original example to in order to reproduce the issue, and using a ReentrantLock also creates the same effect.

Please consider the code below (full runnable code example at https://gist.github.com/revbingo/4c035aa29d3c7b50ed8b - you need to add Commons Math 3.4.1 to the classpath). It creates 100 tasks, and submits them to a thread pool of 5 threads. In the task, two 500x500 matrices of random values are created, and then multiplied.

public class Main {
private static ExecutorService exec = Executors.newFixedThreadPool(5);

private final static int MATRIX_SIZE = 500;
private static UncorrelatedRandomVectorGenerator generator = 
            new UncorrelatedRandomVectorGenerator(MATRIX_SIZE, new StableRandomGenerator(new JDKRandomGenerator(), 0.1d, 1.0d));

private static ReentrantLock lock = new ReentrantLock();

public static void main(String[] args) throws Exception {

    for(int i=0; i < 100; i++) {

        exec.execute(new Runnable() {
            @Override
            public void run() {
                double[][] matrixArrayA = new double[MATRIX_SIZE][MATRIX_SIZE];
                double[][] matrixArrayB = new double[MATRIX_SIZE][MATRIX_SIZE];
                for(int j = 0; j< MATRIX_SIZE; j++) {
                    matrixArrayA[j] = generator.nextVector();
                    matrixArrayB[j] = generator.nextVector();
                }

                RealMatrix matrixA = MatrixUtils.createRealMatrix(matrixArrayA);
                RealMatrix matrixB = MatrixUtils.createRealMatrix(matrixArrayB);

                lock.lock();
                matrixA.multiply(matrixB);
                lock.unlock();
            }
        });
    }
}
}

The ReentrantLock is actually unnecessary. There is no shared state between the threads that needs synchronization. With the lock in place, we expectedly observe the threads in the thread pool blocking. With the lock removed, we expectedly observe no more blocking, and all threads able to run fully in parallel.

The unexpected result of removing the lock is that the code consistently takes longer to complete, on my machine (quad-core i7) by 15-25%. Profiling the code shows no indication of any blocking or waiting in the threads, and total CPU usage is only around 50%, spread relatively evenly over the cores.

The second unexpected thing is that this is also dependent on the type of generator that is used. If I use a GaussianRandomGenerator or UniformRandomGenerator instead of the StableRandomGenerator, the expected result is observed - the code runs faster (by around 10%) by removing the lock().

If threads are not blocking, the CPU is at a reasonable level, and there is no IO involved, how can this be explained? The only clue I really have is that the StableRandomGenerator does invoke a lot of trigonometric functions, so is clearly a lot more CPU intensive than the Gaussian or Uniform generators, but why then am I not seeing the CPU being maxed out?

EDIT: Another important point (thanks Joop) - making generator local to the Runnable (i.e. one per thread) displays the normal expected behaviour, where adding the lock slows the code by around 50%. So the key conditions for the odd behaviour are a) using a StableRandomGenerator, and b) having that generator be shared between the threads. But to the best of my knowledge, that generator is thread-safe.

EDIT2: Whilst this question is superficially very similar to the linked duplicate question, and the answer is plausible and almost certainly a factor, I'm yet to be convinced it's quite as simple as that. Things that make me question it:

1) The problem is only shown by synchronizing on the multiply() operation, which does not make any calls to Random. My immediate thought was that that synchronization ends up staggering the threads to some extent, and therefore "accidentally" improves the performance of Random#next(). However, synchronizing on the calls to generator.nextVector() (which in theory has the same effect, in the "proper" way), does not reproduce the issue - synchronizing slows the code as you might expect.

2) The problem is only observed with the StableRandomGenerator, even though the other implementations of NormalizedRandomGenerator also use the JDKRandomGenerator (which as pointed out is just a wrapped for java.util.Random). In fact, I replaced use of the RandomVectorGenerator with filling in the matrices with direct calls to Random#nextDouble, and behaviour again reverts to the expected result - synchronizing any part of the code causes the total throughput to drop.

In summary, the issue can only be observed by

a) using StableRandomGenerator - no other subclass of NormalizedRandomGenerator, nor using the JDKRandomGenerator or java.util.Random directly, display the same behaviour.

b) synchronizing the call to RealMatrix#multiply. The same behaviour is not observed when synchronizing the calls to the random generator.

If you have a quad i7, you should use 8 threads, not 5- this should max out your CPU. Also the Java version and jre vendor would be interesting. — Thomas Jungblut, Feb 24 '15 at 23:12
Yes, 8 threads does max the CPU. My confusion here is that with less than 8 threads, none of the CPUs are maxed, so in theory I should have some "headroom" to accelerate the calculation, shouldn't I? Also : JVM: Java HotSpot(TM) 64-Bit Server VM (23.3-b01, mixed mode) Java: version 1.7.0_07, vendor Oracle Corporation — RevBingo, Feb 24 '15 at 23:20
What happens when `generator` is not static, but a local variable? — Joop Eggen, Feb 24 '15 at 23:20
Good question, I forgot to add that part. Making `generator` a local variable changes the behaviour to the "normal" expected result - adding the lock slows the code by about 50%. — RevBingo, Feb 24 '15 at 23:23
I do not know the generator's class but the lock makes the control flow more sequential. The simplest explanation would be that too many context switches occur. Sorry no idea. — Joop Eggen, Feb 24 '15 at 23:30
Unable to reproduce on an Atom D525 with OpenJDK Runtime Environment (IcedTea 2.5.3) (7u71-2.5.3-2), it took 97 seconds either way. It might be a CPU specific effect. — that other guy, Feb 24 '15 at 23:39
I could see contexts switches between threads processing lots of data effectively poisoning the CPU's cache. — David Ehrmann, Feb 25 '15 at 00:38

score 4 · Accepted Answer · edited May 23 '17 at 11:43

4

The same problem as here.

You're actually measuring the contention inside a PRNG with a shared state.

JDKRandomGenerator is based on java.util.Random which has seed shared among all your worker threads. The threads compete to update seed in the compare-and-set loop.

Why lock improves performance then? In fact, it helps to reduce contention inside java.util.Random by serializing the work: while one thread performs matrix multiplication, the other is filling the matrix with random numbers. Without lock threads do the same work concurrently.

edited May 23 '17 at 11:43

Community

1
1

answered Feb 25 '15 at 01:28

apangin

92,924
10
193
247

So, after much confusion, I'll take this as "sort of close enough". I still have some loose ends that I don't quite understand, but the compare-and-set competition makes some sense, in that there is some lower-level (i.e. CPU level) synchronization that is forced to occur that is not immediately obvious in the JVM. What's convinced me a little more is that I discovered the code is actually 50% **slower** when it **doesn't** have to do the multiplication! If only all my code got faster the more code I added! – RevBingo Feb 25 '15 at 13:27

DanO · Answer 2 · 2015-02-25T01:29:05.357

There is a lot to remember when using random number generators. Long story short, your quirks were caused because the generators have to collect enough entropy before they can give you a random number. By sharing the generator, each call requires entropy to 'fill back up', so it was your blocking point. Now, some generators work differently than others as to how they collect entropy, so some are more effected or chain, rather than build up from scratch. When you make the generators within the instance, each instance builds entropy on its own, so it's faster.

Let me point you to SecureRandom, in particular the class JavaDoc where it says, "Note: Depending on the implementation, the generateSeed and nextBytes methods may block as entropy is being gathered, for example, if they need to read from /dev/random on various unix-like operating systems." This is what you were seeing and why things were slow. Using a single generator, it kept blocking. Yes, it's thread-safe, but it's blocking while getting entropy (note that you were having contention within your threads as they waiting for the blocking methods to return from generating random numbers building entropy, etc). When you put in your own locks, you were giving it time to collect entropy and do its thing in a 'polite' manner. It may be thread safe, but that doesn't mean that it's nice or efficient when bombarded ;-)

Also, for anything using java.util.Random, from Random,

Instances of java.util.Random are threadsafe. However, the concurrent use of the same java.util.Random instance across threads may encounter contention and consequent poor performance. Consider instead using ThreadLocalRandom in multithreaded designs.

By the way, the gathering of entropy is also why your cores weren't being maxed out. — DanO, Feb 24 '15 at 23:57
From java.util.Random: "Instances of java.util.Random are threadsafe. However, the concurrent use of the same java.util.Random instance across threads may encounter contention and consequent poor performance. Consider instead using ThreadLocalRandom in multithreaded designs." — DanO, Feb 25 '15 at 01:27
By the way, a seed, when chaining and only using a seed with no other external input, *is* your entropy. There is the initial entropy for the first round (which may include a seed or external input), but the other calls rely upon the generated seed to provide 'entropy'. So, as I said, when using a chaining algorithm (which is what we're talking about), we have to wait for the entropy (in this case the seed) to "settle" (change). — DanO, Feb 25 '15 at 02:19
Yes, the point about the seed is valid, thank you both, but here's what doesn't make sense to me - the `lock` is locking the `multiply`, not the vector generation (which uses the `Random`). — RevBingo, Feb 25 '15 at 08:14
My first thought was that synchronizing the `multiply()` just has the effect of "staggering" the threads. However, achieving the same effect by synchronizing the vector generation loop itself slows throughput, as you might expect. — RevBingo, Feb 25 '15 at 08:21

Why does this code run faster with a lock?

2 Answers2

Linked