Java - multithreaded code does not run faster on more cores

Question

I was just running some multithreaded code on a 4-core machine in the hopes that it would be faster than on a single-core machine. Here's the idea: I got a fixed number of threads (in my case one thread per core). Every thread executes a Runnable of the form:

private static int[] data; // data shared across all threads


public void run() {

    int i = 0;

    while (i++ < 5000) {

        // do some work
        for (int j = 0; j < 10000 / numberOfThreads) {
            // each thread performs calculations and reads from and
            // writes to a different part of the data array
        }

        // wait for the other threads
        barrier.await();
    }
}

On a quadcore machine, this code performs worse with 4 threads than it does with 1 thread. Even with the CyclicBarrier's overhead, I would have thought that the code should perform at least 2 times faster. Why does it run slower?

EDIT: Here's a busy wait implementation I tried. Unfortunately, it makes the program run slower on more cores (also being discussed in a separate question here):

public void run() {

    // do work

    synchronized (this) {

        if (atomicInt.decrementAndGet() == 0) {

            atomicInt.set(numberOfOperations);

            for (int i = 0; i < threads.length; i++)
                threads[i].interrupt();
        }
    }

    while (!Thread.interrupted()) {}
}

Can you tell use why you expect this to run faster on more cores? — Mat, Jun 30 '11 at 21:09
It suggests to me that each thread is not running in a separate core, and that makes sense because you've not told it specifically to do this (and can't with standard Java 1.6). — Hovercraft Full Of Eels, Jun 30 '11 at 21:10
@Mat: With more threads, each `Runnable` sleeps shorter. Since they sleep concurrently, they should "wake up" faster. — ryyst, Jun 30 '11 at 21:11
Are you literally using a sleep, or is there some (possibly important) code that you're hiding? — bdonlan, Jun 30 '11 at 21:16
@bdonlan: Just updated my question, because I _was_ hiding some code. — ryyst, Jun 30 '11 at 21:17
@ryyst, the code you're hiding is important here. For example, if you hold a lock throughout the `// do some work` segment, you're only adding overhead, etc. — bdonlan, Jun 30 '11 at 21:19
@ryyst, sure, sleepTime/numberOfThreads*numberOfThreads == sleepTime, div and mul are usually studied in the 2nd grade. — bestsss, Jun 30 '11 at 21:19
@rysst: does the OS you run on guarantee somehow that your threads will be sleeping exactly for .025ms in the four-core case? did you measure that? it that test long enough to overcome thread startup overhead? — Mat, Jun 30 '11 at 21:19
@bdonlan, it was pure sleep like that: `Thread.sleep(0, 100000 / numberOfThreads); barrier.await()` — bestsss, Jun 30 '11 at 21:23
@Mat: Updated the question, I'm not actually sleeping the threads but performing calculations (although they do _not_ involve I/O, synchronization, locks, etc.). I believe the test is long enough to overcome thread startup overhead (~ 20 seconds). — ryyst, Jun 30 '11 at 21:24
@ryyst, with threading performance issues it can be very important to see exactly what you're doing. If you can narrow it down to a small kernel that exhibits this problem, please do post it. — bdonlan, Jun 30 '11 at 21:27
I must say, I still don't get the first comment by @bestsss. Either I'm really slow, or it's like [The Emperor's New Clothes](http://en.wikipedia.org/wiki/The_Emperor%27s_New_Clothes). — aioobe, Jun 30 '11 at 21:58
@aioobe, the initial code was just sleep(1); await(); for 4 threads in a loop — bestsss, Jun 30 '11 at 22:05
Ah, okay. Then it was neither of the two, but rather not my type of humor. — aioobe, Jun 30 '11 at 22:08
@ryyst maybe you should post your code. My guess is that you are doing something very wrong somewhere and you are not seeing it. For example, you might be computing the whole data set on each thread or you are accessing a synchronized object from all threads. — toto2, Jun 30 '11 at 22:28
@ryyst for debugging: print the time when each thread starts/ends the `for` block, print also which range of elements it is processing. Look at the total time each trade needs. You might get the times of each thread: 1t, 2t, 3t and 4t, which would mean they are waiting their turn for a lock. — toto2, Jun 30 '11 at 22:39
HyperThreaded Intel CPUS don't give a 1:1 speed increase, and I have seen degenerate cases with Java code that runs much slower when the # threads == virtual cores vs actual cores. Example: my i3 runs some processes much faster only using 2 threads vs running 4 threads. — , Jul 01 '11 at 21:12

score 10 · Accepted Answer · answered Jun 30 '11 at 21:23

Adding more threads is not necessarily guarenteed to improve performance. There are a number of possible causes for decreased performance with additional threads:

Coarse-grained locking may overly serialize execution - that is, a lock may result in only one thread running at a time. You get all the overhead of multiple threads but none of the benefits. Try to reduce how long locks are held.
The same applies to overly frequent barriers and other synchronization structures. If the inner j loop completes quickly, you might spend most of your time in the barrier. Try to do more work between synchronization points.
If your code runs too quickly, there may be no time to migrate threads to other CPU cores. This usually isn't a problem unless you create a lot of very short-lived threads. Using thread pools, or simply giving each thread more work can help. If your threads run for more than a second or so each, this is unlikely to be a problem.
If your threads are working on a lot of shared read/write data, cache line bouncing may decrease performance. That said, although this often results in performance degradation, this alone is unlikely to result in performance worse than the single threaded case. Try to make sure the data that each thread writes is separated from other threads' data by the size of a cache line (usually around 64 bytes). In particular, don't have output arrays laid out like [thread A, B, C, D, A, B, C, D ...]

Since you haven't shown your code, I can't really speak in any more detail here.

Thank you! I guess my problems is a combination of spending most of the time in the barrier and cache line bouncing. I'd love to show the code, but even though it's only basic computations, it's quite a lot of code... — ryyst, Jun 30 '11 at 21:32
@bdonlan, false sharing combined w/ tight waiting on the cyclic barrier can have the result, i.e. the context switches could be more expensive than the work done (which is additionally worsen by the cache trashing). Also waiting like that on the barrier will impose extra switching latency (too high value for the park) — bestsss, Jun 30 '11 at 21:33
@ryyst, you can try busy waiting (seriously) instead of cyclic barrier. — bestsss, Jun 30 '11 at 21:36
@bestsss: Do you have any suggestions on how to implement busy waiting? — ryyst, Jun 30 '11 at 22:31
@ bdonlan +1 : very good answer ! But what is **cache line bouncing** ? — Suhail Gupta, Jul 01 '11 at 03:43
@Suhail, here is a link to wikipedia false sharing: http://en.wikipedia.org/wiki/False_sharing — bestsss, Jul 01 '11 at 07:56

aioobe · Answer 2 · 2011-06-30T21:55:19.267

4

You're sleeping nano-seconds instead of milli-seconds.

I changed from

Thread.sleep(0, 100000 / numberOfThreads); // sleep 0.025 ms for 4 threads

to

Thread.sleep(100000 / numberOfThreads);

and got a speed-up proportional to the number of threads started just as expected.

I invented a CPU-intensive "countPrimes". Full test code available here.

I get the following speed-up on my quad-core machine:

4 threads: 1625
1 thread: 3747

(the CPU-load monitor indeed shows that 4 course are busy in the former case, and that 1 core is busy in the latter case.)

Conclusion: You're doing comparatively small portions of work in each thread between synchronization. The synchronization takes much much more time than the actual CPU-intensive computation work.

(Also, if you have memory intensive code, such as tons of array-accesses in the threads, the CPU won't be the bottle-neck anyway, and you won't see any speed-up by splitting it on multiple CPUs.)

edited Jun 30 '11 at 21:55

answered Jun 30 '11 at 21:21

aioobe

413,195
112
811
826

uhm, that's bad. Now I can't reproduce. – aioobe Jun 30 '11 at 21:23
I managed to reproduce using a "`countPrimes`" method. Answer updated. – aioobe Jun 30 '11 at 21:42
Your conclusion sounds right. What are your suggestions to fix this (bestsss mentions busy waiting)? – ryyst Jun 30 '11 at 21:47
*Thread.sleep(0, 100000 / numberOfThreads); // sleep 0.025 ms for 4 threads* nanoseconds sleeping i.e. Thread.sleep(long, int) affects only millis (by modifying by 1) the code is something like that `if (nanos >= 500000 || (nanos != 0 && millis == 0)) millis++;` So it's 1ms – bestsss Jun 30 '11 at 21:49
@bestsss, 1) Please provide a reference to some documentation supporting that claim. 2) Even if it's 1 ms, my point is still perfectly valid: The time is really small compared to the time it takes for synchronization. – aioobe Jun 30 '11 at 21:54
@aioobe, that's the source code of Thread.sleep(long, int) since jdk1.0. No need for documentation, it works like that. As doc goes, `File.setLastModified` says *All platforms support file-modification times to the nearest second, but some provide more precision* FAT supports only even times (i.e. 0/2/4, etc), Sun/Oracle knows it but never fixed the doc. Side note: I never said your point about sleep is invalid... – bestsss Jun 30 '11 at 22:01
@ryyst, I don't know. Working with separate data *might* help at some level of caching, but I can't say for sure. For this type of task you might benefit from trying out languages such as Haskell or Erlang. – aioobe Jun 30 '11 at 22:01
@bestsss, sorry, I simply won't buy such argument. Documentation says *[...] sleep (cease execution) for the specified number of milliseconds plus the specified number of nanoseconds [...]* and that is the contract of the method. Sure, on some machines it may sleep longer, and on other shorter. It will depend on implementation, hardware etc etc. I won't change my answer due to some JDK specific implementation details. – aioobe Jun 30 '11 at 22:04
@aioobe, whatever the contact is, it has never been implemented that way, actually in earlier versions of java sleep(0,1) would have been the same as Thread.sleep(0), i.e. forever. – bestsss Jun 30 '11 at 22:08
@bestsss, I don't get the "*forever*" part, but never mind. I see your point, and you probably see mine. Let's leave it at that. – aioobe Jun 30 '11 at 22:09

score 2 · Answer 3 · answered Jun 30 '11 at 21:15

2

The code inside runnable does not actually do anything.
In your specific example of 4 threads each thread will sleep for 2.5 seconds and wait for the others via the barier.
So all that is happening is that each thread gets on the processor to increment i and then blocks for sleep leaving processor available.
I do not see why the scheduler would alocate each thread to a separate core since all that is happening is that the threads mostly wait.
It is fair and reasonable to expect to just to use the same core and switch among threads
UPDATE
Just saw that you updated post saying that some work is happening in the loop. What is happening though you do not say.

answered Jun 30 '11 at 21:15

Cratylus

52,998
69
209
339

I just realized that, so I updated my question. I actually _do_ perform work, so the OS should allocate processing power to each thread. – ryyst Jun 30 '11 at 21:19
@ryyst:But what do you do in the for loop?If you do not post the actual code I do not see how we can help you in the analysis. – Cratylus Jun 30 '11 at 21:21
@user: Basic calculations, such as adding / multiplying `int`s. No locks, no synchronization, no I/O. – ryyst Jun 30 '11 at 21:21
Still just describing does not help.Basic calculation like int addition can be done really fast and still the actual processing be so little that there is no need to use another core. – Cratylus Jun 30 '11 at 21:24
@ryyst, show the code, you might have false sharing, morealso waiting on the barrier is quite expensive in the loop, move the barrier outside the loop. – bestsss Jun 30 '11 at 21:29
"*will sleep for 2.5 seconds*" -- No, the original sleep-code slept for 0.025 ms. See my answer. – aioobe Jun 30 '11 at 21:45
@ryyst, if it's different area and the array is huge, calcs almost non-existent, most likely the application is just memory bound. Adding more core won't help, since the cores spend all the time in cache misses anyways. Paying the extra overhead from cyclic barrier (which is a Lock) and you are done. – bestsss Jun 30 '11 at 21:56

score 2 · Answer 4 · answered Jun 30 '11 at 21:27

2

synchronizing across cores is much slower than syncing on a single core

because on a single cored machine the JVM doesn't flush the cache (a very slow operation) during each sync

check out this blog post

answered Jun 30 '11 at 21:27

ratchet freak

47,288
5
68
106

what do you mean by **JVM doesn't flush the cache** ? – Suhail Gupta Jul 01 '11 at 03:48

bestsss · Answer 5 · 2011-07-01T10:44:06.290

1

Here is a not tested SpinBarrier but it should work.

Check if that may have any improvement on the case. Since you run the code in loop extra sync only hurt performance if you have the cores on idle. Btw, I still believe you have a bug in the calc, memory intense operation. Can you tell what CPU+OS you use.

Edit, forgot the version out.

import java.util.concurrent.atomic.AtomicInteger;

public class SpinBarrier {
    final int permits;
    final AtomicInteger count;
    final AtomicInteger version;
    public SpinBarrier(int count){ 
        this.count = new AtomicInteger(count);
        this.permits= count;
        this.version = new AtomicInteger();
    }

    public void await(){        
        for (int c = count.decrementAndGet(), v = this.version.get(); c!=0 && v==version.get(); c=count.get()){
            spinWait();
        }       
        if (count.compareAndSet(0, permits)){;//only one succeeds here, the rest will lose the CAS
            this.version.incrementAndGet();
        }
    }

    protected void spinWait() {
    }
}

edited Jul 01 '11 at 10:44

answered Jul 01 '11 at 07:52

bestsss

11,796
3
53
63

I don't have the time to find out why right now, but this does not work (I tried with 4 threads). Thanks though! – ryyst Jul 01 '11 at 10:23
@ryyst, figured I have dropped code w/o version, so when the 1st thread wins the CAS any other waiters will see the reverted count back to permits and spin forever. – bestsss Jul 01 '11 at 10:48
Still doesn't work for me... All CPUs are @ 100% but nothing happens. I also made an update to my question. The busy wait mechanism works but runs slower if you put more threads to it. – ryyst Jul 01 '11 at 11:04
@ryyst, busy wait works only if you have less or equal threads to the cores. Are you sure you create the SpinBarrier w/ the same amount of threads you intend to start? Also you cant use synchronized in busy waiting, it defeats any idea of busy. – bestsss Jul 01 '11 at 11:17
I'm on a quadcore and your updated code doesn't work with 2 threads (it does with 1). Also, in my code I'm only using synchronized to decrement the counter, I busy wait outside the synchronized part, right? – ryyst Jul 01 '11 at 20:21

Java - multithreaded code does not run faster on more cores

5 Answers5

Linked