What mechanisms other than mutexs or garbage collection can slow my multi-threaded java program?

Question

Problem

I have a piece of java code (JDK 1.6.0._22 if relevant) that implements a stateless, side effect free function with no mutexes. It does however use a lot of memory (I don't know if that is relevant).

In the past I have visited Sun Laboratories and gathered the standard "performance vs number of threads" curve. As this function has no mutexs, it has a nice graph although the garbage collection kicked in as the number of threads increased. After some garbage collection tuning I was able to make this curve almost flat.

I am now doing the same experiment on Intel hardware. The hardware has 4 CPUs each with 8 cores, and hyperthreading. This gives 64 availableProcessors(). Unfortunately the curve of "performance vs number of threads" scales nicely for 1, 2, 3 threads, and caps at 3 threads. After 3 threads I can put as many threads as I want to the task, and the performance gets no better

Attempts to fix the Problem

My first thought was that I had been stupid and introduced some synchronised code somewhere. Normally to resolve this issue I run JConsole or JVisualVM, and look at the thread stacktraces. If I have 64 threads running at the speed of 3, I would expect 61 of them to be sitting waiting to enter a mutex. I didn't find this. Instead I found all the threads running: just very slowly.

A second thought was that perhaps the timing framework was introducing problems. I replaced my function with a dummy function that just counts to a billion using an AtomicLong. This scaled beautifully with number of threads: I was able to count to a billion 10,000 times 64 times quicker with with 64 threads than with 1 thread.

I thought (desperation kicking in) perhaps garbage collection is taking a really really long time, so I tweaked the garbage collection parameters. While this improved my latency variation, it had no effect on throughput: I still have 64 threads running at the speed I expect 3 to run at.

I have downloaded the intel tool VTunes, but my skill with it is weak: it is a complex tool and I don't understand it yet. I have the instruction book on order: a fun Christmas present to myself, but that is a little too late to help my current problem

Question

What tools (mental or software) could I use to improve my understanding of what is going on?
What mechanisms other than mutexs or garbage collection could be slowing my code down?

An obvious guess is that if one function call uses much memory, then four calls use four times as much, hitting a limit where this becomes a problem. — Marko Topolnik, Dec 20 '12 at 09:41
Thanks for the answer Marko. I have put a sysout for memory available, and it stands at 800M free fairly consistently. I don't seem to have a memory leak. I wondered if I might be locking "getting memory" but I have never seen this in the past, or read anything about it — Stave Escura, Dec 20 '12 at 09:42
The JVM observing available heap is not at all a guarantee that it isn't hitting a performance threshold, like hitting the swap file. CPU-to-memory bandwidth also comes into play. — Marko Topolnik, Dec 20 '12 at 09:50

Peter Lawrey · Answer 1 · 2012-12-20T10:01:49.420

11

I have a piece of java code (JDK 1.6.0._22 if relevant)

There have been quite allot of performance improvements since then. I would try Java 6 update 37 or Java 7 update 10.

It does however use a lot of memory

This can mean the way you access your data can be important. Accessing data in main memory can be 20+x slower than in your primary cache. This means you have to access data conservatively and make the most of each piece of new data you access.

After 3 threads I can put as many threads as I want to the task, and the performance gets no better Instead I found all the threads running: just very slowly.

This suggest you are using a resource to it's maximum. The most likely resource to be maxed out given the amount of memory you are using is the cpu to main memory bridge. I suspect you have one bridge for 64 threads! This means you should consider ways which might use more cpu but improve the way you access memory (less randomly and more sequentially) and reduce the volumes when you do (use more compact types where possible). e.g. I have "short with two decimal places" type instead of float which can half the memory used.

As you observed, when each thread is updating it's own private AtomicLong you got linear scalability. This won't use the cpu to main memory bridge at all.

From @Marko

Peter, do you have an idea how these multicore architectures work with memory, anyway?

Not as much as I would like as this issue is not visible to Java.

Does each core have an independent channel?

Each core has an independent channel to the primary cache. For the outer cache there might be a channel for each or 2-6 cache areas but you will a high number of collisions under heavy load.

For the bridge to the main memory, There is one very wide channel. This favours long sequential accesses but is very poor for random accesses. A single thread can max this out with random reads (random enough they don't fit in the outer cache)

Or at least independent as long at there are no collisions?

Once you exhaust the primary cache (L1 typically 32 KB) it's collisions all the way.

Because otherwise scaling is a great issue.

As the OP demonstrates. Most applications either a) spend a significant portion of time waiting for IO b) does allot of computation on small batches of data. Doing allot of computation across large amounts of data is the worst cases senario.

The way I deal with this is to arrange my data structures in memory for sequential access. I use off heap memory which is a pain, but gives you full control of the lay out. (My source data is memory mapped for persistence) I stream the data in with sequential accesses and try to make the most of that data (i.e. I minimise repeated accesses) Even then with 16 cores it is hard to assume all of them will be used efficiently as I have 40 GB of source data I am working on at any one time and about 80 GB of derived data.

Note: High end GPUs address this problem by having incredibly high memory bandwidth. The top end processor can get 250 GB/second whereas a typical CPU is about 4-6 GB/second. Even so they are better suited to vectorised processing and their quoted peak performance is likely to have little memory access e.g. mandelbrot sets.

http://www.nvidia.com/object/tesla-servers.html

edited Dec 20 '12 at 10:01

answered Dec 20 '12 at 09:41

Peter Lawrey

525,659
79
751
1,130

Peter, do you have an idea how these multicore architectures work with memory, anyway? Does each core have an independent channel? Or at least independent as long at there are no collisions? Because otherwise scaling is a great issue. – Marko Topolnik Dec 20 '12 at 09:47
Thanks Peter. The function is a pure function, and while it needs some reference data from main memory, that data is immutable. The memory it uses is small enough to fit inside the L1 cache. The function only takes 20..40us in normal use, but uses 10s of K memory during this time. My skill with VTune isn't good, but it seems to support this statement. I will try and see if I can get more data about whether this is the bottle neck. I'll try binding a JVM to a cpu, and see if it caps at one thread performance. This would support other theories too...but it should be an interesting result – Stave Escura Dec 20 '12 at 09:56
The code is targeted for a customer that runs 1.6, but I think I will try the other JVMs anyway just for comparison. – Stave Escura Dec 20 '12 at 09:57
Much of the performance enhancements in Java 7 were back ported in Java 6 update 35+ I only suggest Java 7 as Java 6 is almost End of Free Support. – Peter Lawrey Dec 20 '12 at 10:04
If you think about it this way. You have one shared main memory bus which is 20x slower than each L1 cache,or the native clock speed. If you have 64 threads trying to use them at once you need only about 1/64/20 of the time accessing the main memory to fill it up or about 0.05% of the time. – Peter Lawrey Dec 20 '12 at 10:06
Why does the Tesla GPU have 250 GB/second bandwidth, because it has 2688 cores! Imagine how had it is to keep them all busy esp as GPU have to run the same code on every core. (You can't have different cores running different things like a CPU can) – Peter Lawrey Dec 20 '12 at 10:23
The interesting thing about the memory bus is that I have 4 cpus, each with 8 cores. I am limited to scalability of 4...could that be each CPU is maxing out its memory bus. Even more interesting: on different hardware I wasn't limited. My problem is though that the VTune parameters indicate that most of the memory access is via the L1 cache. However I may still be having learning issues with how to use the tool. Can you suggest any ways in which I can test the theory that the memory bus is my bottleneck? – Stave Escura Dec 20 '12 at 11:28
@MarkoTopolnik If you have some spare time and have not read [this document](http://stackoverflow.com/questions/8126311/what-every-programmer-should-know-about-memory) yet, you might find it interesting. – assylias Dec 20 '12 at 12:06
1

@assylias Thanks, a great resource! My question is answered almost immediately: if NUMA is properly leveraged by carefully binding Thread-Local Allocation Buffers to the appropriate memory bank, and of course backing this with thread affinity, there will be no contention between CPUs for memory bandwidth in a case such as OP's (memory used only by a single method call). – Marko Topolnik Dec 20 '12 at 12:18
Very nice article. I won't be able to do more experiments for a few hours, but I appreciate the assistance. I'm hoping to get some more VTune data, but currently it looks from the VTune stats that all the memory is L1 cache, and not swamping the bus. Something is blocking me though, and it still doesn't look to be mutexs. – Stave Escura Dec 20 '12 at 12:47
@assylias Ha ha, I went to save the PDF, only to find it already there in my Tech Docs folder! Date: 11/29/2007 :) – Marko Topolnik Dec 20 '12 at 12:47

score 9 · Accepted Answer · answered Dec 22 '12 at 13:47

Well many experiments later I discover that the JVM makes no difference, but I also discovered the power of JDump. 50 of the 64 threads were at the following line.

java.lang.Thread.State: RUNNABLE
    at java.util.Random.next(Random.java:189)
    at java.util.Random.nextInt(Random.java:239)
    at sun.misc.Hashing.randomHashSeed(Hashing.java:254)
    at java.util.HashMap.<init>(HashMap.java:255)
    at java.util.HashMap.<init>(HashMap.java:297)

Random.next looks like this

 protected int next(int bits) {
    long oldseed, nextseed;
    AtomicLong seed = this.seed;
    do {
        oldseed = seed.get();
        nextseed = (oldseed * multiplier + addend) & mask;
    } while (!seed.compareAndSet(oldseed, nextseed));
    return (int)(nextseed >>> (48 - bits));
 }

Most interestingly is that this isn't an obvious lock, so the tools I using to spot mutexes weren't working.

So it looks as though any creation of java hashmaps causes applications to stop being scalable (I exaggerate but not much). My application does make heavy use of hashmaps, so I guess I either rewrite hashmap or rewrite the application.

I'm raising a separate question to see how to deal with this.

Thanks for all the help

You may want to try replacing your HashMaps with [Javolution FastMaps](http://javolution.org/target/site/apidocs/javolution/util/FastMap.html). They are designed to provide real time behavior. This should be easy to try and may correct the problem, without causing a large change to your code. — Christian Trimble, Dec 22 '12 at 15:08
Thanks for the suggestion, I'll give them an evaluation. I'm currently thinking about the best approach. Google maps are another option: I already used some of their code in other projects, and they are quite nicely implemented. I'm also considering Scala immutable maps as well: immutability is just a good thing in threaded code. — Stave Escura, Dec 22 '12 at 16:57

score 0 · Answer 3 · answered Dec 23 '12 at 13:29

0

You might be running into the allocation wall, that is: your program can run no faster than object allocation, which is limited by memory bandwidth.

answered Dec 23 '12 at 13:29

nes1983

15,209
4
44
64

I'd like to visit the link: but it requires payment. – Stave Escura Jan 03 '13 at 15:21
The abstract is public, tho. – nes1983 Jan 03 '13 at 16:55

What mechanisms other than mutexs or garbage collection can slow my multi-threaded java program?

3 Answers3

Linked