Ubuntu 12.04 LTS
java -version
java version "1.6.0_38"
Java(TM) SE Runtime Environment (build 1.6.0_38-b05)
Java HotSpot(TM) 64-Bit Server VM (build 20.13-b02, mixed mode)
4 core CPU - some Dell server hardware
10 threads from time to time run a "heavy" job over several minutes. At other periods they are doing nothing.
1 thread is supposed to wake up every 5 (or so) secs and send a quick ping over the network to another process. This works nicely as long as the other 10 threads do nothing, but when the other 10 threads are running a "heavy" job it never (or very rarely) get to run and send its ping.
I could understand this if this "heavy" job was CPU intensive. But during such a "heavy" job top says something like 50-100% IO-wait but around 1% CPU usage. Profiling shows that by far most of the time spent by the 10 threads are spent in (waiting I guess) in some NIO call. It all adds up, and is kinda expected because a lot of the heaviness about the job is to read files from disk.
What I do not understand is that during such a "heavy" job, the 1 thread doing pings do not get to run. How can that be explained when top shows 1% CPU usage and it seems (profiling and top) that the 10 threads are spending most of their time waiting for IO. Isnt the 1 ping-thread supposed to get execution-time when the other threads are waiting for IO?
Java thread priority is equal on all 11 threads.
Spreading a few yields here and there in the 10 threads seem to solve (or lower) the problem, but I simply do not understand why the ping thread does not get to run without the yields, when the other threads do not do much but wait for IO.
ADDITIONAL INFO 05.03.2014
I have reproduced the problem in a simpler setup - even though not very simple yet (you will have to find out how to install a Apache Zookeeper server, but it is fairly simple - I can provide info later)
Find Eclipse Kepler project here (maven - build by "mvn package"): https://dl.dropboxusercontent.com/u/25718039/io-test.zip
Find binary here: https://dl.dropboxusercontent.com/u/25718039/io-test-1.0-SNAPSHOT-jar-with-dependencies.jar
Start a Apache ZooKeeper 3.4.5 (on port 2181) server on a machine. On another separate machine (this is where I have Ubuntu 12.04 LTS etc. as described above) run the binary as follows (first create a folder io-test-files - 50GB space needed)
nohup java -cp io-test-1.0-SNAPSHOT-jar-with-dependencies.jar dk.designware.io_test.ZKIOTest ./io-test-files 10 1024 5000000 IP-of-ZooKeeper-server:2181 > ./io-test-files/stdouterr.txt 2>&1 &
First it creates 10 5GB files (50GB is way more than machine RAM so not much help by OS file-cache), then starts a ZooKeeper client (which is supposed to keep its connection with the ZooKeeper server up by sending pings/heartbeats regularly), then makes 10 threads doing random-access into the 10 files creating a lot of disk IO, but really no real usage of the CPU. I see that the ZooKeeper client eventually loses its connection ("Zk state"-prints stop saying "CONNECTED" - in stdouterr.txt), and that is basically what I do not understand. The ZooKeeper client-thread only wants to send a tiny heartbeat with several secs apart, and only if it is not able to do that within a period of 20 secs is will lose its connection. I would expect it to have easy access to the CPU, because all the other threads are basically only waiting for disk IO.
During the test I see the following using "top"
- Very high "Load average". Above 10 which I do not understand, because there are basically only 10 threads doing something. I also thought that "Load average" only counted threads that actually wanted to do real stuff on the CPU (not including waiting of IO), but according to http://en.wikipedia.org/wiki/Load_%28computing%29 Linux also counts "uninterruptible sleeps" including threads waiting of IO. But I really do not hope/think that it will prevent other threads that have real stuff to do, from getting their hands on the CPU
- Very high %wa, but almost no %sy and %us on the CPU(s)
Here is the output from one of my runs: https://dl.dropboxusercontent.com/u/25718039/io-test-output.txt