Java CPU-intensive application stalls/hangs when increasing no. of workers. Where is the bottleneck, and how to deduce/ monitor it on a Ubuntu server?

Question

I'm running a nightly CPU-intensive Java-application on an Ec2-server (c1.xlarge) which has eight cores, 7.5 GB RAM (running Linux / Ubuntu 9.10 (Karmic Koala) 64 bit).

The application is architected in such a way that a variable number of workers are constructed (each in their own thread) and fetch messages from a queue to process them.

Throughput is the main concern here and performance is measured in processed messages / second. The application is NOT RAM-bound... And as far as I can see not I/O-bound. (although I'm not a star in Linux. I'm using dstat to check for I/O-load which are pretty low and CPU wait-signals (which are almost non-existent)).

I'm seeing the following when spawning a different number of workers (worker-threads).

Worker: throughput 1.3 messages / sec / worker
worker: ~ throughput 0.8 messages / sec / worker
worker: ~ throughput 0.5 messages / sec / worker
worker: ~ throughput 0.05 messages / sec / worker

I was expecting a near-linear increase in throughput, but reality proves otherwise.

Three questions:

What might be causing the sub-linear performance going from one worker --> two workers and two workers --> three workers?
What might be causing the (almost) complete halt when going from three workers to four workers? It looks like a kind of deadlock-situation or something.. (can this happen due to heavy context-switching?)
How would I start measuring where the problems occur? My development-box has two CPUs and is running under windows. I normally attach a GUI-profiler and check for threading-issues. But the problem only really starts to manifest itself my more than two threads.

Some more background information:

Workers are spawned using a Executors.newScheduledThreadPool
A workers-thread does calculations based on the message (CPU-intensive). Each worker-thread contains a separate persistQueue used for offloading writing to disk (and thus make use of CPU / I/O concurrency.)

persistQueue = new ThreadPoolExecutor(1, 1, 100, TimeUnit.MILLISECONDS, new ArrayBlockingQueue(maxAsyncQueueSize), new ThreadPoolExecutor.AbortPolicy());

The flow (per worker) goes like this:

The worker-thread puts the result of a message in the persistQueue and gets on with processing the next message.
The ThreadpoolExecutor (of which we have one per worker-thread) only contains one thread which processes all incoming data (waiting in the persistQueue ) and writes it to disk (Berkeley DB + Apache Lucene).
The idea is that 1. and 2. can run concurrent for the most part since 1. is CPU-heavy and 2. is I/O-heavy.
It's possible that persistQueue becomes full. This is done because otherwise a slow I/O-system might cause flooding of the queues, and result in OOM-errors (yes, it's a lot of data). In that case the workerThread pauses until it can write its content to persistQueue. A full queue hasn't happened yet on this setup (which is another reason I think the application is definitely not I/O-bound).

The last information:

Workers are isolated from the others concerning their data, except:
- They share some heavily used static final maps (used as caches. The maps are memory-intensive, so I can't keep them local to a worker even if I wanted to). Operations that workers perform on these caches are: iterations, lookups, contains (no writes, deletes, etc.)
- These shared maps are accessed without synchronization (no need. right?)
- Workers populate their local data by selecting data from MySQL (based on keys in the received message). So this is a potential bottleneck. However, most of the data are reads, queried tables are optimized with indexes and again not I/O-bound.
- I have to admit that I haven't done much MySQL-server optimizing yet (in terms of config -params), but I just don't think that is the problem.
Output is written to:
- Berkeley DB (using memcached(b)-client). All workers share one server.
- Lucene (using a home-grown low-level indexer). Each workers has a separate indexer.
Even when disabling output writing, the problems occur.

This is a huge post, I realize that, but I hope you can give me some pointers as to what this might be, or how to start monitoring / deducing where the problem lies.

score 1 · Answer 1 · answered Dec 23 '09 at 14:26

1

If I were you, I wouldn't put much faith in anybody's guesswork as to what the problem is. I hate to sound like a broken record, but there's a very simple way to find out - stackshots. For example, in your 4-worker case that is running 20 times slower, every time you take a sample of a worker's call stack, the probability is 19/20 that it will be in the hanging state, and you can see why just by examining the stack.

answered Dec 23 '09 at 14:26

Mike Dunlavey

40,059
14
91
135

Just only now saw your comment. Do you know of a good tool to take / visualize stackshots in a linux server environment? – Geert-Jan Dec 28 '09 at 20:45
**pstack** is one such tool. In a case like this you need very few samples - in fact, just **one sample** is almost certain to show you exactly where the problem is. – Mike Dunlavey Dec 29 '09 at 03:10
Related: *[What can I use to profile C++ code in Linux?](http://stackoverflow.com/questions/375913/what-can-i-use-to-profile-c-code-in-linux/378024#378024)* – Peter Mortensen Mar 07 '16 at 20:55

score 0 · Answer 2 · edited Mar 07 '16 at 20:45

0

Only profiling will help.

But things to check:

The workers get information from a queue. What type of queue is it that the producer queue thread saves?
Why use Executors.newScheduledThreadPool to create your workers? Don't you just want them to run immediately?

edited Mar 07 '16 at 20:45

Peter Mortensen

30,738
21
105
131

answered Dec 23 '09 at 09:53

Peter

5,728
20
23

the message queue is Amazon Simple Queue Service (AWS SQS). Fetching from the queue is non blocking (afaik) Executors.newScheduledThreadPool is used for the purpose of having a a small rampoff (as I believe the English term is) between workers. So intializing of workers is more fluent. – Geert-Jan Dec 23 '09 at 11:06

score 0 · Answer 3 · answered Dec 23 '09 at 10:05

If I understood correctly, multiple workers are all fetching from the same queue, make calculations and hand the result off to their private writers, like:

              / [ worker ] - [ writer, queue ]
[ msg-queue ] - [ worker ] - [ writer, queue ]
              \ [ worker ] - [ writer, queue ]

workers might be blocking to get to the msg queue, adding a reader managing a queue of work items solve this problem if it occurs, like:

                                   / [ worker ] - [ writer, queue ]
[ msg-queue ] - [ fetcher, queue ] - [ worker ] - [ writer, queue ]
                                   \ [ worker ] - [ writer, queue ]

Another thing I pick up from your description is that the calculations make use of a set of collections in a read-only fashion so concurrency should not be a problem. It might be a good idea to investigate which implementation you use, even if you don't synchronise use in your part of the code, collection classes like Vector and Hashtable synchronize by default.

Using immutable versions of collection classes would help to make sure usage of the maps can be concurrent by default.

your model is correct. the queue is Amazon Simple Queue Service (SQS) which is designed for such a thing I believe. I will check the java-client implementation (SQS) though just to be sure. As for the collections: standar Java.util + some basis arrays. I will wrap the collections in immutable just to be sure. Thanks — Geert-Jan, Dec 23 '09 at 11:37

score 0 · Answer 4 · answered Apr 24 '20 at 06:25

0

Wild guess you shared not blocking data structure spawns lots of memory fence operations destroying CPU cashes. Not blocking is not available.

answered Apr 24 '20 at 06:25

user1391095

33
3

Java CPU-intensive application stalls/hangs when increasing no. of workers. Where is the bottleneck, and how to deduce/ monitor it on a Ubuntu server?

4 Answers4