6

We're running SonarQube 5.1.2 on an AWS node. After a short period of use, typically a day or two, the Sonar web server becomes unresponsive and spikes the server's CPUs:

top - 01:59:47 up 2 days,  3:43,  1 user,  load average: 1.89, 1.76, 1.11
Tasks:  93 total,   1 running,  92 sleeping,   0 stopped,   0 zombie
Cpu(s): 94.5%us,  0.0%sy,  0.0%ni,  5.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   7514056k total,  2828772k used,  4685284k free,   155372k buffers
Swap:        0k total,        0k used,        0k free,   872440k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                                                                                           
 2328 root      20   0 3260m 1.1g  19m S 188.3 15.5  62:51.79 java                                                                                                                                                                                                              
   11 root      20   0     0    0    0 S  0.3  0.0   0:07.90 events/0                                                                                                                                                                                                           
 2284 root      20   0 3426m 407m  19m S  0.3  5.5   9:51.04 java                                                                                                                                                                                                               
    1 root      20   0 19356 1536 1224 S  0.0  0.0   0:00.23 init 

The 188% CPU load is coming from the WebServer process:

$ ps -eF|grep "root *2328"
root      2328  2262  2 834562 1162384 0 Mar01 ?       01:06:24 /usr/java/jre1.8.0_25/bin/java -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djruby.management.enabled=false -Djruby.compile.invokedynamic=false -Xmx768m -XX:MaxPermSize=160m -XX:+HeapDumpOnOutOfMemoryError -Djava.io.tmpdir=/opt/sonar/temp -cp ./lib/common/*:./lib/server/*:/opt/sonar/lib/jdbc/mysql/mysql-connector-java-5.1.34.jar org.sonar.server.app.WebServer /tmp/sq-process615754070383971531properties

We initially thought that we were running on way too small of a node and recently upgraded to an m3-large instance, but we're seeing the same problem (except now it's spiking 2 CPUs instead of one).

The only interesting info in the log is this:

2016.03.04 01:52:38 WARN  web[o.e.transport] [sonar-1456875684135] Received response for a request that has timed out, sent [39974ms] ago, timed out [25635ms] ago, action [cluster:monitor/nodes/info], node [[#transport#-1][xxxxxxxx-build02-us-west-2b][inet[/127.0.0.1:9001]]], id [43817]
2016.03.04 01:53:19 INFO  web[o.e.client.transport] [sonar-1456875684135] failed to get node info for [#transport#-1][xxxxxxxx-build02-us-west-2b][inet[/127.0.0.1:9001]], disconnecting...
org.elasticsearch.transport.ReceiveTimeoutTransportException: [][inet[/127.0.0.1:9001]][cluster:monitor/nodes/info] request_id [43817] timed out after [14339ms]
    at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:366) ~[elasticsearch-1.4.4.jar:na]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [na:1.8.0_25]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [na:1.8.0_25]
    at java.lang.Thread.run(Unknown Source) [na:1.8.0_25]

Does anyone know what might be going on here or has some ideas how to further diagnose this problem?

G. Ann - SonarSource Team
  • 22,346
  • 4
  • 40
  • 76
raner
  • 1,175
  • 1
  • 11
  • 21
  • 1
    Difficult to tell what the problem exactly is. If you want to dig further, you should try to monitor precisely the JVM - for instance to see if this is not the GC who's eating all the CPU. – Fabrice - SonarSource Team Mar 07 '16 at 10:56
  • Thanks, Fabrice. I'll do a full thread dump next time it happens. – raner Mar 08 '16 at 19:58
  • Thread dump is here: http://pastebin.com/AAg0RRVm – raner Mar 09 '16 at 01:03
  • Interestingly, it seems that the process recovers after a while and reverts to normal CPU usage. However, once this problem starts occurring it will continue to happen rather frequently unless the server is restarted. – raner Mar 09 '16 at 01:05
  • I had hoped to provide some further insight by getting a Flight Recorder dump at the time this problem happens, but due to the high CPU load jcmd is not even able to attach to the process. I'm stumped at this point; on some days we have to restart SonarQube several times a day. Has anyone seen anything like this before or has any ideas how I could further diagnose this? – raner Mar 15 '16 at 17:03
  • Which Linux distro is this running on ? – Nicolas B. Mar 16 '16 at 12:39
  • @n_stan `$ lsb_release -a LSB Version: :base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch Distributor ID: CentOS Description: CentOS release 6.6 (Final) Release: 6.6 Codename: Final` – raner Mar 16 '16 at 17:03
  • I'm not sure that the ReceiveTimeoutTransportException is the root cause of the pb. It can be a consequence of the CPU spike. Do you see any relation with the HTTP requests processed at that time (see access.log) ? Note that I suggest to upgrade to 5.2 or greater. Version 5.1 puts too much pressure on Elasticsearch indices, so indirectly on overall performances. – Simon Brandhof Mar 16 '16 at 20:31
  • @SimonBrandhof-SonarSource we're heavily using incremental mode, which is not available in 5.2, as far as I understand. – raner Mar 16 '16 at 23:13
  • We could really use some help with this... I found some .hprof files that were due to an out of memory error, so I changed the memory limit to -Xmx2048m, but we're still seeing the same problem... we have to restart the server several times a day now :-( – raner Mar 16 '16 at 23:15
  • This is what I get when I'm trying to profile it during the CPU spike: $ sudo /usr/java/default/bin/jcmd 31990 JFR.start 31990: com.sun.tools.attach.AttachNotSupportedException: Unable to open socket file: target process not responding or HotSpot VM not loaded – raner Mar 16 '16 at 23:17
  • @SimonBrandhof-SonarSource In terms of relation with HTTP requests, about 20 people plus several automated build processes are using this SonarQube installation, so it's hard to detect a correlation... – raner Mar 16 '16 at 23:20
  • @user1932890 the "issues" mode introduced in v5.3 is incremental by default (see https://jira.sonarsource.com/browse/SONAR-6931). For more details please start a new thread. You should not be restricted with version 5.1 which is known to have performance and stability issues. – Simon Brandhof Mar 17 '16 at 10:50

0 Answers0