3

Ubuntu 12.04 LTS

java -version
java version "1.6.0_38"
Java(TM) SE Runtime Environment (build 1.6.0_38-b05)
Java HotSpot(TM) 64-Bit Server VM (build 20.13-b02, mixed mode)

4 core CPU - some Dell server hardware

10 threads from time to time run a "heavy" job over several minutes. At other periods they are doing nothing.
1 thread is supposed to wake up every 5 (or so) secs and send a quick ping over the network to another process. This works nicely as long as the other 10 threads do nothing, but when the other 10 threads are running a "heavy" job it never (or very rarely) get to run and send its ping.

I could understand this if this "heavy" job was CPU intensive. But during such a "heavy" job top says something like 50-100% IO-wait but around 1% CPU usage. Profiling shows that by far most of the time spent by the 10 threads are spent in (waiting I guess) in some NIO call. It all adds up, and is kinda expected because a lot of the heaviness about the job is to read files from disk.

What I do not understand is that during such a "heavy" job, the 1 thread doing pings do not get to run. How can that be explained when top shows 1% CPU usage and it seems (profiling and top) that the 10 threads are spending most of their time waiting for IO. Isnt the 1 ping-thread supposed to get execution-time when the other threads are waiting for IO?

Java thread priority is equal on all 11 threads.

Spreading a few yields here and there in the 10 threads seem to solve (or lower) the problem, but I simply do not understand why the ping thread does not get to run without the yields, when the other threads do not do much but wait for IO.

ADDITIONAL INFO 05.03.2014

I have reproduced the problem in a simpler setup - even though not very simple yet (you will have to find out how to install a Apache Zookeeper server, but it is fairly simple - I can provide info later)

Find Eclipse Kepler project here (maven - build by "mvn package"): https://dl.dropboxusercontent.com/u/25718039/io-test.zip
Find binary here: https://dl.dropboxusercontent.com/u/25718039/io-test-1.0-SNAPSHOT-jar-with-dependencies.jar

Start a Apache ZooKeeper 3.4.5 (on port 2181) server on a machine. On another separate machine (this is where I have Ubuntu 12.04 LTS etc. as described above) run the binary as follows (first create a folder io-test-files - 50GB space needed)

nohup java -cp io-test-1.0-SNAPSHOT-jar-with-dependencies.jar dk.designware.io_test.ZKIOTest ./io-test-files 10 1024 5000000 IP-of-ZooKeeper-server:2181 > ./io-test-files/stdouterr.txt 2>&1 &

First it creates 10 5GB files (50GB is way more than machine RAM so not much help by OS file-cache), then starts a ZooKeeper client (which is supposed to keep its connection with the ZooKeeper server up by sending pings/heartbeats regularly), then makes 10 threads doing random-access into the 10 files creating a lot of disk IO, but really no real usage of the CPU. I see that the ZooKeeper client eventually loses its connection ("Zk state"-prints stop saying "CONNECTED" - in stdouterr.txt), and that is basically what I do not understand. The ZooKeeper client-thread only wants to send a tiny heartbeat with several secs apart, and only if it is not able to do that within a period of 20 secs is will lose its connection. I would expect it to have easy access to the CPU, because all the other threads are basically only waiting for disk IO.

During the test I see the following using "top"

  • Very high "Load average". Above 10 which I do not understand, because there are basically only 10 threads doing something. I also thought that "Load average" only counted threads that actually wanted to do real stuff on the CPU (not including waiting of IO), but according to http://en.wikipedia.org/wiki/Load_%28computing%29 Linux also counts "uninterruptible sleeps" including threads waiting of IO. But I really do not hope/think that it will prevent other threads that have real stuff to do, from getting their hands on the CPU
  • Very high %wa, but almost no %sy and %us on the CPU(s)

Here is the output from one of my runs: https://dl.dropboxusercontent.com/u/25718039/io-test-output.txt

Per Steffensen
  • 613
  • 2
  • 7
  • 19
  • This is a very generic question without any code example. Can you reproduce the problem for us to see? – Giovanni Botta Feb 28 '14 at 18:12
  • yes please, what @GiovanniBotta said (: – Arc Feb 28 '14 at 18:17
  • this question sounds like "it doesnt work, tell me why?" you can have milions of problems when you use threads. dead locks, live locks, starvation, syncronization problems etc... please edit this question and post your code – Dima Feb 28 '14 at 18:17
  • I can reproduce it consistently on my "real" system, but its a lot of code that I cannot just share here. I will try to reproduce in a smaller example. For now I am just asking if anyone can come up with any explanation for this given the facts that I provided. Just looking for a answer a-la: Yes I can explain that, because jvm is not doing preemptive scheduling, or the threads doing a NIO will still want the CPU, or something. Just POTENTIAL explanations. Code will come later, if am able to reproduce in smaller setup – Per Steffensen Feb 28 '14 at 18:18
  • No deadlocks. Starvation, hmmm, guess you can call the 50-100% IO wait starvation, but 1% CPU does not seem like starvation, and this question is basically about why this thread does not get the CPU. No sync-problems - non of the threads really depends on each other, they all just do a lot of reading and processing on their individual files – Per Steffensen Feb 28 '14 at 18:27
  • how do you know there is no deadlocks? how do you know there is no live locks? are you sure? maybe the memory is overloaded? maybe there is some "syso" in your code that does not go to any output and just stays in buffer? – Dima Feb 28 '14 at 18:28
  • '1% CPU usage' - not livelock. Should work OK, really. – Martin James Feb 28 '14 at 18:47
  • ' thread is supposed to wake up every 5 (or so) secs' - how are you waiting? – Martin James Feb 28 '14 at 18:48
  • 2
    without seeing your code very hard to say what else it might be doing to cause this. have you tried to up the priority of your 5s task higher than the others and see if that has any effect? – kg_sYy Feb 28 '14 at 18:52
  • 2
    sending ping is also an IO operation, so this thread might be stuck waiting IO too – hoaz Feb 28 '14 at 18:54
  • are these these tasks separate java threads, or are they sharing a thread pool? – jtahlborn Feb 28 '14 at 18:54
  • @kg_sYy - sound to me like it's not becoming ready at all. I suspect some nasty timer/schedule/threadpool artifact. I think jtahlborn is thinking along the same lines. – Martin James Feb 28 '14 at 19:01
  • @hoaz - I don't see it, somehow. Disk and network are different channels, different drivers. – Martin James Feb 28 '14 at 19:16
  • Any chance to run with just 9 "heavy" jobs and see what happens? – Harald Feb 28 '14 at 21:19
  • Thanks for the input, guys. The ping is network IO, I wouldnt expect that to have problems getting out when the IO from the 10 threads is disk IO. This single thread is actually a ZooKeeper client sending heartbeats. Not sure how it waits but I will look into it. We have tried to up the priority of the ZK thread, but it doesnt solve the problem. The 10 threads are running from a threadpool, but I am sure the ping thread is either its own thread or at least not run from a threadpool shared with the 10 threads. The 10 is not exact - it might be 8 og 12, but I am sure lower to 9 will not help. – Per Steffensen Mar 02 '14 at 12:16
  • I do not hope that processes/threads in Linux waiting for disk IO will prevent other processes/threads from running. Even though it is synchronous and blocking IO – Per Steffensen Mar 05 '14 at 15:14

0 Answers0