4

Environment :

  • Tomcat 8 (same results in Tomcat 7)
  • JDK 1.7
  • Spring MVC 3
  • Oracle UCP 11.2.0.3
  • Windows Server 2008 R2

I have a single Tomcat 8 instance (same behavior exhibited in Tomcat 7) with two separate Spring MVC applications deployed in it. They share a common starter project and have many things in common, including their use of Oracle UCP, but perform different functions.

Project #1 has been in production for 9 months+ and has never had any speed issues. Project #2 is recently developed, similarly to Project #1, but suffers from a gradual slow down in response times.

The strange part is that, in the same application server, Project #1 runs fine/quickly always whereas Project #2 will gradually slow down over 24 to 48 hours and become so slow it is almost unusable (10+ second response times). Because of this evidence it seems that it is not a Tomcat issue but a project issue.

I don't believe it to be a memory leak because I have :

  1. Profiled the applications with jVisualVM and found no evidence of memory leaks
  2. Taken repeated heap dumps and analyzed them with MAT with no evidence of memory leaks
  3. Manually watched the memory usage and garbage collection and not seen any obvious evidence of memory leaks
  4. Scanned the code with FindBugs - no detrimental bugs found
  5. Scanned the code with SonarQube - no detrimental bugs found
  6. Tomcat's "detect leaks" function finds none
  7. And, it never seems to run out of memory, it just gets painfully slow. Redeploying the application returns it to normal speeds for a while.

I've also monitored the threads with jVisualVM and do not have any unexpected waiting threads. There are a lot of UCP-Worker-Threads, but I do maintain multiple connection pools with up to 10 connections per pool.

Viewing the Tomcat application specific logs for Project #2, the slowness doesn't seem to be attributed to any particular part of the application. It's just that every operation becomes very slow. For example, decrypting a string in (slow) Project #2 is shown to take over 1 second to execute after the deployment slows down. But, in the (fast) Project #1, the exact same code (same library) decrypts the string in milliseconds (or less than one millisecond). Those two tests were executed within a minute or two of each other.

I've also run (slow) Project #2 in Tomcat by itself (no other applications deployed besides Tomcat Manager) and the same behavior occurs over time. That should eliminate conflicts between the two applications as the cause.

I've been researching for this a long time and am running out of ideas. The main thing that confuses me is how one deployment can run very slowly while a similar deployment in the same instance of Tomcat can run quickly.

Looking for any help, suggestions, pointers, tools to use for more research, info, etc... please : )

(Rough) Timing Comparison

I added a small bit of code to both projects that simply counts to 100 million, appends 1 million characters to a StringBuffer, then returns the timing information in nanoseconds (System.nanoTime()). The following two tests were run back to back :

Fast/Normal Project #1

Counted to 100 million in : 174604ns 
Appended 1 million characters to StringBuffer in 13075475ns 
Total execution time : 13475805ns

Slow Project #2

Counted to 100 million in : 218954364ns 
Appended 1 million characters to StringBuffer in 13098243ns 
Total execution time : 450329651ns

Dependency Info

org.apache.poi - 3.9
org.json - 20131018
com.oracle.ucp - 11.2.0.3
com.oracle.ons - 11.2.0.3
com.oracle.ojdbc6 - 11.2.0.3
org.springframework.spring-context - 3.2.3.RELEASE
org.springframework.spring-webmvc - 3.2.3.RELEASE
org.springframework.spring-context-support - 3.2.3.RELEASE
org.aspectj.aspectjrt - 1.7.3
org.aspectj.aspectjweaver - 1.7.3
org.apache.httpcomponents.httpclient - 4.3
org.apache.httpcomponents.httpcore - 4.3
com.google.code.gson - 2.2.2
cglib - 2.2.2
org.codehaus.jackson.jackson-core-asl - 1.9.12
org.codehaus.jackson.jackson-mapper-asl - 1.9.12
com.nimbusds.nimbus-jose-jwt - 2.10.1
org.slf4j.slf4j-api - 1.7.5
ch.qos.logback.logback-classic - 1.0.13
ch.qos.logback.logback-core - 1.0.13
javax.inject - 1
javax.servlet.servlet-api - 2.5
javax.servlet.jsp.jsp-api - 2.1
javax.servlet.jstl - 1.2
org.jsoup - 1.7.2
commons-fileupload - 1.3
com.google.guava - 16.0.1
+1 proprietary internal library

<!-- Test Scope - shouldn't affect deployment -->
junit - 4.11
org.mockito.mockito-core - 1.9.5
org.springframework.spring-test - 3.2.3.RELEASE
org.hsqldb - 2.3.1
commons-dbcp - 1.2.2
org.hamcrest.hamcrest-all - 1.3
com.jayway.jsonpath.json-path - 0.8.1
com.jayway.jsonpath.json-path-assert - 0.8.1

Progress?

While under heavy load today I noticed several HTTP APR threads blocking to wait on logback. (fast) Project #1 does not use logback but log4j instead. That is one of the few dependencies that are different... we may have a culprit. I'll post back when I have more info.

darcling
  • 61
  • 1
  • 6
  • Operations on the same library are a good comparison. Does that library use any container resources (such as database connection) or is it stateless, self-contained code in both cases (so that you can rule out different external resources, contention for resources, or a less-efficient saved state somewhere)? – user1676075 Sep 04 '14 at 13:41
  • It has the potential to call an external API via HTTPS but I compared instances where that was not the case - it was self contained. In this example, the cookie/encrypted string should be the only resource accessed - no database or other container resources. Also, the string decryption would've been on the exact same string since it is an encrypted value stored in a cookie. The two tests were done with one specific user, authenticated, and with the same session/same cookie. – darcling Sep 04 '14 at 15:00
  • Sounds like potentially more likely an application issue. Might check session timeout on each to compare. The slowdown over 24-48 hours; can you measure if the slowdown is actually based on 1) time (not likely); 2) number of requests to the server (not likely); or 3) number of requests to that application (also total requests or requests from one user)? I'd expect the latter which would imply it's probably a specific application configuration or coding issue. – user1676075 Sep 04 '14 at 16:28
  • A suggestion: run the slow project on a separate Tomcat instance. That could rule out any interference of shared resources. – vanOekel Sep 04 '14 at 17:26
  • vanOekel : I failed to mention that I have tried running the slow project by itself and the same behavior manifests. I'll update the question with that info. – darcling Sep 04 '14 at 17:44
  • As you already studied many causes, I can only add a list of other common problem sources : temporary tables and there indexes, sessions (size and number), threads (number and local memory), servlet context attributes (mainly lists or hashes), database connections, and global structures holding references to transient objects. – Serge Ballesta Sep 05 '14 at 07:23
  • Concerning logback: [LOGBACK-268](http://jira.qos.ch/browse/LOGBACK-268) might be relevant (solved in version 1.1.2). I was also thinking that latency might be the problem (see [this answer](http://stackoverflow.com/a/15050971/3080094)). – vanOekel Sep 05 '14 at 19:52

1 Answers1

2

While monitoring threads under heavy load I noticed several of my HTTP APR threads were blocking intermittently. I took multiple thread dumps and traced the resource back to :

ch.qos.logback.core.rolling.RollingFileAppender.subAppend(RollingFileAppender.java:170)
- waiting to lock <0x00000000d02bf8c0> (a ch.qos.logback.core.rolling.SizeBasedTriggeringPolicy)

0x00000000d02bf8c0 is my locked resource and the class of it is probably enough to find the cause but I dug down in the dump for a bit more info about the resource :

"http-apr-8080-exec-119" daemon prio=6 tid=0x0000000014262000 nid=0x1e9c runnable [0x000000002a63b000]
   java.lang.Thread.State: RUNNABLE
    at java.util.zip.CRC32.updateBytes(Native Method)
    at java.util.zip.CRC32.update(CRC32.java:65)
    at java.util.zip.ZipOutputStream.write(ZipOutputStream.java:327)
    - locked <0x00000000e6d5c758> (a java.util.zip.ZipOutputStream)
    at ch.qos.logback.core.rolling.helper.Compressor.zipCompress(Compressor.java:110)
    at ch.qos.logback.core.rolling.helper.Compressor.compress(Compressor.java:58)
    at ch.qos.logback.core.rolling.FixedWindowRollingPolicy.rollover(FixedWindowRollingPolicy.java:157)
    at ch.qos.logback.core.rolling.RollingFileAppender.rollover(RollingFileAppender.java:138)
    - locked <0x00000000d02bf750> (a ch.qos.logback.core.spi.LogbackLock)
    at ch.qos.logback.core.rolling.RollingFileAppender.subAppend(RollingFileAppender.java:171)
    - locked <0x00000000d02bf8c0> (a ch.qos.logback.core.rolling.SizeBasedTriggeringPolicy)

And there you have it - logback is blocking threads that are attempting to log while archiving/compressing a log that met the defined SizeBasedTriggeringPolicy.

I looked at the config and found that the log size was set to only 5MB. I actually made the issue worse while researching it because I had set the appender level to DEBUG so we were generating more log statements, faster. I bumped the log file size up considerably and haven't had a problem since. I plan to move to a date based log policy soon because that's what I've always used previously and it doesn't look like much consideration was given to the log configuration on this project to being with : )

@user1676075 was on the right track with the shared resources idea but I didn't think about my logging being a resource.

One thing I'm not positive of is why the application couldn't recover after hitting this bottleneck. If it was under continual load then I can see it backing up the thread pool and exhausting all worker threads. But, once this situation occurred, the application remained very slow until it was redeployed or the server was restarted. I will need to ensure that it doesn't die when it reaches the new/higher filesize limit or when I move to a date based rolling policy.

Thank you for all the suggestions!

darcling
  • 61
  • 1
  • 6