24

I am running into a very peculiar issue. My tomcat runs perfectly at about 25% CPU 24/7, but some days my CPU shoots up to 60% and the system grinds to a halt and fails to recover.

When I take a thread dump during the slow down almost all the threads are busy with some kind of String or related operation.

There are no OutOfMemory errors or any exceptions being thrown, all requests are still handled but response times deteriorate to the nth degree where even a sub second request slows down to take 60 seconds and more.

My server config is as follows:

    Ubuntu 12.04.2 LTS
    Linux 3.2.0-38-virtual #60-Ubuntu SMP x86_64 x86_64 x86_64 GNU/Linux
    java version "1.7.0_13"
    Java(TM) SE Runtime Environment (build 1.7.0_13-b20)
    Java HotSpot(TM) 64-Bit Server VM (build 23.7-b01, mixed mode)
    export JAVA_OPTS='-server
    -Xms18g -Xmx18g
    -XX:MaxPermSize=512m
    -XX:ThreadStackSize=512
    -XX:NewRatio=1
    -XX:SurvivorRatio=4
    -XX:+UseConcMarkSweepGC
    -XX:+UseParNewGC
    -XX:+CMSClassUnloadingEnabled
    -Xloggc:/usr/tomcat/logs/gc.log
    -XX:+PrintGCDetails
    -XX:+PrintGCDateStamps
    -XX:+PrintTenuringDistribution
    -Dcom.sun.management.jmxremote
    -Dcom.sun.management.jmxremote.port=9999
    -Dcom.sun.management.jmxremote.authenticate=false
    -Dcom.sun.management.jmxremote.ssl=false
    -Djava.awt.headless=true'

Click here to download thread dump. I have removed the bulk of the threads and their stackTraces

Click here to download vmstat log

Click here to download gc log

Any ideas as to the cause of this? Thanks

Rudi Strydom
  • 334
  • 3
  • 12
  • Do you get more requests at those times? Do you have network graphs to compare to? – Brendan Long Feb 01 '13 at 16:16
  • 2
    Any chance this being stop-the-world GC? Do you log gc events? – mindas Feb 01 '13 at 16:50
  • maybe your app is happy running with a set of usual chars, then once in a blue moon an unusual char shows up, which somehow disturbed the vm optimization. you can test this theory by feeding different chars and see if any char would trigger the problem. if it's true it should be a vm bug. – irreputable Feb 01 '13 at 17:17
  • 3
    Most threads are executing String.toUpperCase(). Unfortunately the stack traces do not show what code calls this method. Did you truncute the stack traces? – gogognome Feb 26 '13 at 15:14
  • This is a typical scenario when your web server gets hit by a crawler, or otherwise gets hit by a lot of requests. Enable access logs so you at least can see what's hitting your server. – nos Sep 19 '13 at 21:17
  • Regular expressions is a typical candidate for wild runs. In your case probably not. – Joop Eggen Nov 22 '13 at 15:34
  • mindas: Definitively not GC. In a STW collection, the JVM will use all CPUs at 100% unless prevented from doing so (e.g. by pinning). It wouldn;t stop at 60% CPU utilization. – kittylyst Nov 23 '13 at 14:17
  • Is the usual thread dump different? Less string operations? – flup Nov 25 '13 at 01:08
  • 4
    Regarding "I have removed the bulk of the threads and their stackTraces," just give us everything. Otherwise you might delete an important detail. If you wish to hide some information, then abbreviate the package or rename the Class.method(). – jtoberon Nov 26 '13 at 03:48

9 Answers9

3

To try to pinpoint the offending requests you may configure Stuck Thread Detection Valve in Tomcat.

This valve allows to detect requests that take a long time to process, which might indicate that the thread that is processing it is stuck.

When such a request is detected, the current stack trace of its thread is written to Tomcat log with a WARN level.

The IDs and names of the stuck threads are available through JMX in the stuckThreadIds and stuckThreadNames attributes. The IDs can be used with the standard Threading JVM MBean (java.lang:type=Threading) to retrieve other information about each stuck thread.

Community
  • 1
  • 1
fglez
  • 8,422
  • 4
  • 47
  • 78
3

If the CPU utilisation is below 100% and yet the app has ground to a halt, this implies that something is preventing the CPU from being fully utilized.

I/O or excessive context switching (e.g. caused by locks) are the usual culprits for this.

Can you post the output from vmsstat 1 during one of the events? - the next step in diagnosis is to disambiguate whether context switching is the issue here.

kittylyst
  • 5,640
  • 2
  • 23
  • 36
3

This is not a memory issue, as in your dump the GC isn't even busy and there is enough memory available. In addition the CPU is stuck at 60%, but if the application would be busy calculating (GC or something else) it would be stuck at 100%, same if this was a network attack. So the source of this problem must include some disk-IO operation.

Tomcat is known to be buggy and has several severe issues. One thing that I encountered is that for no particular reason, Tomcat suddenly flooded its own logfile with nonsense-entries. This not only caused the disk to fill to 100% but as well dramatically slowed down incoming requests. You can check this by having a look at the tomcat logs and its size.

If this is not the source, you should use the available tools to check for any strange disk-IO of Tomcat and proceed from there.

Community
  • 1
  • 1
TwoThe
  • 13,879
  • 6
  • 30
  • 54
3

I think your problem is this configuration decision -XX:PermSize=320m -XX:MaxPermSize=320m by not allow your PemSpace to dynamically change, you're causing a deadlock when you exhaust it - remember that the intern cache uses PermSpace. I would try changing -XX:MaxPermSize=320m to -XX:MaxPermSize=512m.

Elliott Frisch
  • 198,278
  • 20
  • 158
  • 249
3

Try increasing the maximum size of the Code Cache with the following JVM option:

-XX:ReservedCodeCacheSize=256m

See my answer to another question for the background to this suggestion.

Community
  • 1
  • 1
Jonas Meller
  • 1,816
  • 2
  • 14
  • 6
  • Hi Jonas, thank for the suggestion. If the JIT compiler is disabled it would explain why the sudden onset of poor performance is experienced. I have implemented it in my web servers and will let you know the outcome. – Rudi Strydom Nov 13 '14 at 16:48
  • Hi Rudi, any news? I'm curious to know if it helped you as well. – Jonas Meller Nov 27 '14 at 11:15
  • So far so good, almost a month without any issues. I am awarding the points after having this question open for 668 days. – Rudi Strydom Dec 01 '14 at 15:43
2

Are there any anomalities in the GC log ? It seems like you are running with a pretty big heap with some unusual options and do a lot of String allocating stuff. Maybe you suffer from Heap fragmentation over time (CMS does not compact). Also ensure no swapping is going on (may happen if heap is way too large, so rarely visited by VM)

I'd suspect this to be GC related, as apparently there is no thread blocked. Have you tried with a more recent JDK ? Also you might retry but remove somewhat uncommon option -XX:+CMSScavengeBeforeRemark, because those might have not much test coverage with each minor JDK release.

Another suspicion might be incoming requests using strange charsets (kyrillic or arabic) leading to massive Charset mapping overhead. Also check if there is some robot on your page, any suspicious requests coming in ? You definitely need longer stacktraces to find out the root operation of string manipulation.

R.Moeller
  • 3,436
  • 1
  • 17
  • 12
  • I also conclude that it is GC related. But I have doubts about being a character set issue (though using UTF-8 rather than latin1 is a good idea), because I see many threads at ``CharacterDataLatin1:222`` ... which is a simple assignment: ``int mapChar = ch;``. OP never provided the gc timestamp output. I would suggest [reading this answer](http://stackoverflow.com/a/4331097/431356) Re: Solaris. – brettw Nov 28 '13 at 09:28
  • Hi, thanks for the response. I have uploaded vmstat as well as gc logs. – Rudi Strydom Aug 20 '14 at 11:11
1

You need to use BTrace diagnosing the method invoking.

Write a breace script like this:

Trace com.xx.xx prefix class who invoke String's any method, and print invoke times.

@TLS
private static Map<String, Integer> countMap = BTraceUtils.newHashMap();

private static String prefix = "com.xx.xx";// package like com.xx.xx which you want to trace ()

@OnMethod(clazz = "java.lang.String", method = "/.*/") //all method in String
public static void traceMethodInvoke() {
    String str = BTraceUtils.jstackStr();
    for (String currentClass : str.split("\\n")) {
        if (BTraceUtils.Strings.startsWith(currentClass, prefix)) {
            if (!countMap.containsKey(currentClass)) {
                countMap.put(currentClass, 1);
            } else {
                countMap.put(currentClass, countMap.get(currentClass) + 1);
            }
            break;
        }
    }
}

@OnTimer(5000)
public static void print() {
    BTraceUtils.println("========================================");
    for (Map.Entry<String, Integer> entry : countMap.entrySet()) {
        if (entry.getValue() > 100) {// print if cont > 10
            BTraceUtils.println(entry.getValue() + "\t\t" + entry.getKey());
        }
    }
    BTraceUtils.println("===========================================");

}  

The result output like this:

====================================================
1022                           com.xx.xx.classA#m1
322                            com.xx.xx.classA#m2
2022                           com.xx.xx.classA#m21
422                            com.xx.xx.ccc.classX#m11
522                            com.xx.xx.zz.classS#m44
.........

You can change the prefix for tracing another package prefix.

According the result, you can analyze the source code and find out the problems.

Community
  • 1
  • 1
lichengwu
  • 4,277
  • 6
  • 29
  • 42
1

Having scanned through the thread dump looking at the RUNNABLE threads, one things stands out. Your system seems to be processing / trying to process a large number of requests at the same time. And unless you have a number of cores, there is likely to be a lot of time slicing. On the other hand, I can see no clear >>evidence<< that this is GC related. (But you didn't include the GC logs ...)

I suggest that you look at two things.

  • Look at the operating system's virtual memory statistics. One possible cause of catastrophic system slowdown is virtual memory thrashing. This is where the total application demand for virtual memory pages exceeds the physical memory available ... and the operating system spends a lot of time swapping pages between physical memory and swap disc / page file.
  • Look at the pattern of requests you are getting. It could be that at certain times the number / type of requests you are getting simply exceeds your system's capacity.

If the problem is VM thrashing, then the solution is to reduce application memory demand. The simple way to do this is to reduce the Java heap size.

If the problem is load, then that is harder to solve:

  • You could try beefing up your hardware (or adding more VCPUs to your virtual).
  • You could try spreading the load over multiple instances of your server.
  • You could try reducing the number of worker threads so that your server doesn't try to handle so many requests at a time.
  • You could try profiling the application and/or analysing the request stats to see if there are hotspots that could be tuned, or expensive functionality that could be turned off ...

Finally, you could see if it helps to switch from CMS to the Parallel Collector; refer to the Oracle GC Tuning page: Available Collectors. But I doubt that this is a GC problem.

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
0

The first thing that you should proceed is on finding out which threads are actually consuming the CPU. It could be the threads which on performing string operations or it could be other VM thread which may be performing GC & Sweep operations. The link says how to co-relate the CPU spikes with the thread dump

Once you can pin point the threads it would be more clear as to what should be next step forward.

Hope this helps

Community
  • 1
  • 1
Anugoonj
  • 575
  • 3
  • 9
  • I can definitely confirm that it is not due to GC, here are the stats of what the VM performed: 2500 ParNew operations with a collection time of 486,866ms 5 ConcurrentMarkSweep operations with a collection time of 3,029ms – Rudi Strydom Feb 14 '13 at 13:09