3

I'm working with a application which provides REST API. It works with GET requests only and usually spends up to 100ms to handle even the most heavy requests.

Recently we started facing a problem that from time to time heap gets fullfilled and full GC takes a lot of time and it really affects our clients.

  • Heap size - 5GB
  • Concurrent requests per second - up to 50
  • Each request consumes up to 1-2MB(There are few requests that consume 12 MB of heap)
  • Application uses Paralel GC for both Old and young geerations.
  • JDK 1.7

From my point of view this application should promote some object in the Old Gen at the very beginning of work and after it all new objects should get removed from Eden or Survivor spaces. Thus full GC shoulld never come. However some objects keep get promoted into Old Gen and it's really confusing for me.

Several additional findings:

  • We enabled JVM flag -XX:+PrintTenuringDistribution and according to this flag output new tenuring threshold becomes equal to 1 in few minutes after application starts, even without significant load.
  • We took a memory dump when the application consumed almsot all of allowed memory and according to the MAT analyzer statistics almost all of objects inn dump are unreachable(So, I can't understand why they get promoted into Old Gen)
  • Survivor spaces are almost not used. Looks like new objects get promoted into Old gen right from Eden space.
  • Minor GC comes each ~10-20 seconds.
  • Increasing Eden/Survivor spaces didn't help to solve the problem(tenuring threshold is still 1)

So, could you please advice:

  • For what reasons tenuring threshold becomes 1 so quickly?[Resolved]
  • What additional steps could I do to avoid promoting new objects into old gen?

Please let me know if you need some additional info.

Edit:

JVM parameters(Updated):

-XX:MaxPermSize=512m -Xmx7g -Xmn4g -verbose:gc -Xloggc:gc.log -XX:+PrintGCDetails -XX:+PrintTenuringDistribution

Application logs:

Desired survivor size 520617984 bytes, new threshold 15 (max 15)
 [PSYoungGen: 3235245K->231763K(3665920K)] 3942774K->939308K(4760064K), 0.0905430 secs] [Times: user=0.31 sys=0.00, real=0.09 secs] 
603.561: [GC
Desired survivor size 521142272 bytes, new threshold 15 (max 15)
 [PSYoungGen: 3369299K->330881K(3684864K)] 4076844K->1038442K(4779008K), 0.1343473 secs] [Times: user=0.51 sys=0.00, real=0.13 secs] 
606.347: [GC
Desired survivor size 506462208 bytes, new threshold 15 (max 15)
 [PSYoungGen: 3507329K->215655K(3685376K)] 4214890K->923233K(4779520K), 0.0925060 secs] [Times: user=0.36 sys=0.00, real=0.09 secs] 
609.084: [GC
Desired survivor size 492306432 bytes, new threshold 15 (max 15)
 [PSYoungGen: 3392103K->213344K(3713536K)] 4099681K->920945K(4807680K), 0.0802729 secs] [Times: user=0.30 sys=0.00, real=0.08 secs] 

Important detail:

Even when I start performance test with only 1 thread(that consumes ~10 MB per request) Old Gen keeps growing: Old Gen with 1 thread perf test

If I start GC it successfully cleanups memory

If I make a heap dump Mat Analyzer (or Yourkit) shows again that 80% of objects are unreachable

evgeniy44
  • 2,862
  • 7
  • 28
  • 51
  • Have you set a pause time goal? The parallel collector does not have one by default – the8472 Jun 16 '16 at 22:26
  • How large are the objects in relation to the survivor spaces? If an object is too large to fit into the available Eden space then it will get promoted straight away. Which then sucks as it will not get GCd until the old space does, even though all references to it are gone. For example, rest Apis that create large string objects to contain Json or Xml can hit this problem. – Chris K Jun 16 '16 at 23:47
  • 1
    A related gotcha could also be if string.intern is called often enough with variable data.. Such as an xml string. – Chris K Jun 16 '16 at 23:52
  • 1
    The following article may help you. http://www.javaaddicts.net/blogs/-/blogs/understanding-premature-promotion-and-how-to-avoid-it – Chris K Jun 17 '16 at 09:02
  • @the8472, I added JVM parameters to the question. Please see updates. – evgeniy44 Jun 17 '16 at 11:30
  • @ChrisK, the biggest objects take up to 5MB. And there are only few of them who consume so much memory. 99.9% percents of objects are up to 2MB. Also, as I mentioned, increasing survivor ratio didn't help to resolve this problem. – evgeniy44 Jun 17 '16 at 11:32
  • @evgeniy44 that could be your problem then.. how large is the TLA set to? If you are unsure have a read of http://apmblog.dynatrace.com/2011/05/11/how-garbage-collection-differs-in-the-three-big-jvms/ to get a feel for how large objects are handled in Java. The quick answer btw is that they get tenured immediately, by passing the survivor spaces. – Chris K Jun 17 '16 at 11:36
  • @evgeniy44 You may also want to play with the following setting -XX:PretenureSizeThreshold=2m (details on these options can be read here http://blog.ragozin.info/2011/09/hotspot-jvm-garbage-collection-options.html) – Chris K Jun 17 '16 at 11:39
  • @evgeniy44 thank you for the GC logs. The line 'Desired survivor size 341835776 bytes' is pretty important given the total heap size has only been set to 4.5Gb. How large did you increase the eden/survivor spaces to previously? Make sure that they are all significantly larger than 0.5Gb each and then rerun the same work load as before. – Chris K Jun 17 '16 at 13:21
  • @the8472, could you please explain what is the purpose of using pause time goal? – evgeniy44 Jun 17 '16 at 13:29
  • @ChrisK, I updated JVM params, so the survivor size should be about ~700MB – evgeniy44 Jun 17 '16 at 17:40
  • @evgeniy44 have you used jconsole or something similar to confirm the actual sizes? Remember the JVM may dynamically resize them. Initially I would go with Peter's suggestion of super sizing the Eden and survivor spaces. What was the result after the change? – Chris K Jun 17 '16 at 17:50
  • @ChrisK, I checked actual sizes with jconsole. Also as this issue is reproducable even with single thread I think that it doesn't make sense to increase heap size more than 5GB – evgeniy44 Jun 17 '16 at 20:35
  • @evgeniy44 why would one thread not be able to generate enough to overflow the survivor space? Check the GC log, has the desired survivor size been met? Have you also confirmed that the allocations are not going to String.intern or Unsafe or direct byte buffers and that the object sizes are all below the pretenure threshold/tlab sizes? Java is optimised for lots of small objects (measured in bytes), large objects can cause issues. A large object is any object that does not fit within the TLAB buffer... Which is not very large, I believe that it is measured by default in kb. – Chris K Jun 17 '16 at 21:40
  • If you have checked all of that, and run through the config advice that Peter described then you could consider replicating under different GC algorithms and versions of the JDK. I doubt that you have hit a JVM bug but it could be worth a try once you have exhausted the more obvious cases. The G1 allocator in Java 8 may behave better for you. The most likely explanation though is that the object churn rate is just too high for the app, either in rate of objects or in size. Probably both. An architecture review/use of a profiler would be the best way to verify this. – Chris K Jun 17 '16 at 21:48
  • @evgeniy44 why don't you read the documentation instead? https://docs.oracle.com/javase/8/docs/technotes/guides/vm/gctuning/ergonomics.html#sthref12 – the8472 Jun 18 '16 at 01:17
  • @ChrisK, I see that there are tons of factors that may cause this issue. The only thing that I want to ask you if you could provide some usefull docs in order to help me figure out details and internals of java GC. – evgeniy44 Jun 20 '16 at 15:04
  • @evgeniy44 to simplify, there is likely to be only one real issue here.. the rate of object allocation within this app is greater than the GC's ability to free it while still keeping the object within the young generations.. from there there are many ways that can lead to that situation, but they all fall into three basic camps 1) rate of allocation, 2) size of each allocation and 3) length of time that the object is referable. – Chris K Jun 20 '16 at 15:18
  • @evgeniy44 I am currently looking for some links that may help you, while I do that you should consider looking into jclarity (both their tool and their consulting services). If you are working on a commercial product, their fees would be reasonable and they are GC experts (ftr I am not associated with them). On the other hand, if you would like to discuss this in more detail via Skype, then I would be willing to help you that way (no fees). Skype would just be easier to help you to get traction on this issue than chat on StackOverflow. – Chris K Jun 20 '16 at 15:22
  • @evgeniy44 Sun's original GC white paper is an excellent read: http://www.oracle.com/technetwork/java/javase/memorymanagement-whitepaper-150215.pdf. It covers the trade offs to different approaches and gives a really firm understanding of why Sun choice generational GC algorithms and their trade offs (ie the Generational Hypothesis is a very important core concept). – Chris K Jun 20 '16 at 15:27
  • A few covering TLABs (mostly for background reading): http://stackoverflow.com/questions/26351243/allocations-in-new-tlab-vs-allocations-outside-tlab and https://blogs.oracle.com/jonthecollector/entry/the_real_thing and http://stackoverflow.com/questions/24618467/size-of-huge-objects-directly-allocated-to-old-generation – Chris K Jun 20 '16 at 16:14
  • @evgeniy44 just to capture what we discussed on Skype. our hypothesis is that GC can cope under lower loads. However when the load passes a threshold, GC starts to fall behind. when this happens, premature Tendring occurs. Our initial challenge is to prove this. We can indirectly verify this by adjusting the amount of load that we are using to break the app. Start by reducing the load until the app stops failing, then increase the load back up to the point where it just starts to fail. This tipping point is useful as we can use it to measure the impact of GC config settings. – Chris K Jun 20 '16 at 17:43
  • From our Skype discussion you told me that Eden is 3.5GB and s1 and s2 are 400Mb each. When you changed the size of the young generation the app still failed, and this is to be expected. The question is now, by increasing the sizes of s1 and s2.. Does it move the tipping point where the load becomes too much for the JVM to cope with? If yes, then we are facing a problem where we have overloaded the GC algorithm. Larger young generations will Shift the threshold where the app will fail, but will not resolve the underlying problem. We can discuss how to resolve after we have verified this – Chris K Jun 20 '16 at 18:00
  • We have a similar problem - objects are constantly being promoted because young gen collections happen very often - every 4-8 seconds. So it is enough for an object to survive for 1 minute to get promoted. – Imaskar Sep 28 '19 at 22:09

1 Answers1

1

If your tenuring threshold drops to 1 quickly it sounds like there is some other parameter you are using which is causing this. Apart from the logging options and the maximum heap, I would try removing all other tuning arguments and only add in each one at a time when you can determine they really help. The more you set the more likely you which get some strange behavior. You could try to force the tenuring threshold but I suspect this would only hide you problem at best.

Desired survivor size 341835776 bytes, new threshold 1 (max 15)

In an attempt to increase the survivor size to avoid a full survivor triggering a a full GC, the survivor size or just one generation is using the whole young space.

I would try making the Young generation much larger. You need an Eden of at least you survivor space so your young should be at least 1 GB ideally several times larger. I would try 4 GB young or 2 Gb depending on how much space you have. e.g.

-Xmn2g -Xmx5g

or

-Xmn4g -Xmx7g

if you have plenty of memory e.g. 64 GB or more, I would try it much larger like

 -Xmn24g -Xmx32g

and reduce the young space if it doesn't appear to help.

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
  • Please see parameters in update to the question. There are no more parameters that could affect the process – evgeniy44 Jun 17 '16 at 11:29
  • I have added to my answer. How much memory do you have? Can you try this on a machine with at least 32 GB free memory? – Peter Lawrey Jun 17 '16 at 12:24
  • Unfortunately I can provide only 5GB as heap size. Also please take a look to my last finding(performance test with a single thread). For single thread 2.5 GB space for Young Gen should enough – evgeniy44 Jun 17 '16 at 17:42
  • 1
    I ran app with -Xmn4g -Xmx7g and threshold became 15(sometimes it changes but it get back quickly to 15). But I still see that new objects get promoted to old gen – evgeniy44 Jun 20 '16 at 15:20
  • 2
    @evgeniy44 This means you have some medium lived objects which live long enough to survive 15 minor collections. You could increase the young space further but in an complex application there always be some objects which live a little longer. Increasing the young space increases the time between collections making it less likely but not impossible to survive the 15 collections. – Peter Lawrey Jun 20 '16 at 18:09