5

We are running a RT system in Java. It often uses relatively large heaps (100+GB) and serves requests coming from message queue. Each request must be handled fast (<100ms) to meet the SLAs.

We are experiencing serious GC-related problems, because it often happens that GC causes stop-the-world collection during a request (200+ms), resulting in failure.

One of our developers with reasonable knowledge of GCs spent quite some time with tuning GC parameters and trying different GCs. After several days, he came up with some parametrization that we jokingly call "evolved by genetic algorithm". It lowers the GC pauses, but is still far from meeting the SLA requirements.

The solution I am looking for is to protect some critical parts of code from GC, and after a request is finished, let the GC do as much work as it needs, before taking next request. Occasional pauses outside the requests would be OK, because we have several workers and garbage-collecting workers would just not ask for requests for a while.

I have some ideas which are silly, ugly, and most probably not working, but hopefully they illustrate the problem:

  • Occasionally call Thread.sleep() in the receiving thread, praying for the GC to do some work in the meantime,
  • Invoke System.gc() or Runtime.gc() between requests, again hopelessly praying for it to help,
  • Mess the code entirely with hacky patterns like https://stackoverflow.com/a/6915221/1137187.

The last important note is that we are a low-budget startup and commercial solutions such as Zing® are not an option for us, we are looking for a non-commercial solution.

Any ideas? We would rewrite our code entirely to C++ (we didn't know that GC might be a problem rather than solution at the beginning), but the code-base is too large already to do that.

Community
  • 1
  • 1
Tregoreg
  • 18,872
  • 15
  • 48
  • 69
  • Java is certainly not the first language that occurs to me when I hear the term "real-time", and given that Java is chosen, the need for a ginormous heap seems not to bode well. – John Bollinger Oct 26 '16 at 20:55
  • 2
    In any event, there are really only two general approaches to a GC problem in a long-running process: (1) reduce the amount of garbage being produced, and (2) make the garbage faster to collect. If full GCs are costly but infrequent, then one alternative might be to greatly reduce the heap size. That will require more frequent GCs, but each one should be faster because there cannot be as much garbage to collect. Also, try to avoid long-lived objects, which are more expensive for a generational GC to collect, unless they are retained for the whole life of the process. – John Bollinger Oct 26 '16 at 20:58
  • 2
    Additionally, take care with temporary objects. It's not uncommon for Java coders to rely much more heavily on GC than need be by creating and discarding lots of objects. They may not even recognize they are doing so. String concatenation and autoboxing can contribute here, for example. Primitives have no GC cost, and as a rule of thumb, lower-level APIs produce less garbage. – John Bollinger Oct 26 '16 at 21:05
  • In addition, the only new objects that should be created are the incoming messages. Every object that processes the message should be stateless and a singleton (IoC via Spring is simple). This should minimize the number of objects created/destroyed/GC'd. Also, consider a cluster with many nodes, where each node has a smaller heap. – Andrew S Oct 26 '16 at 21:06
  • 1
    With respect to protecting critical code from GC, Java provides no API directly applicable to that objective. If it did, its use would introduce a risk that the critical code would fail with an `OutOfMemoryError` where otherwise it would just be delayed by GC. That would make a much bigger mess. – John Bollinger Oct 26 '16 at 21:17
  • @JohnBollinger Thanks for your notes, seems like we really are producing too much garbage, such as building large `HashMap` objects separately in 8 threads during each request, implying lot of `Double` wrappers instead of primitives, all of which is thrown away in the end. We would call it our responsibility to provide the instance with enough RAM to prevent `OutOfMemoryError`s, but I get your point. Thinking about what you say, do you think it would help to create a shared `HashMap` pool for recycling, by example, or does most of our trouble come from wrappers? – Tregoreg Oct 26 '16 at 21:36
  • 1
    @Tregoreg, almost all of the objects in the graph of a large `HashMap` are associated with the entries; they will become garbage whether you reuse the maps or not. If these are a significant contributor to the problem, then it is possible that coming up with a way to replace them with `double[]`s would help, provided that doing so does not require as many extra objects to be created elsewhere as are saved from the maps. – John Bollinger Oct 26 '16 at 21:55
  • A simple thing you can do: post GC logs and see if anyone can spot obvious improvements. – the8472 Oct 26 '16 at 23:09
  • Maybe try a trove HashMap instead of the default Java version. http://trove.starlight-systems.com/overview – Justin Oct 26 '16 at 23:14

1 Answers1

3

Any ideas?

Use a different JVM? Azul claims to be able to handle such cases. Redhat and Oracle are contributing shenandoah and zgc to openjdk, respectively, with similar goals in mind, so maybe you could try experimental builds if you don't want a commercial solution.

There also are other JVMs focused on realtime applications, but as I understand it they focus on harder realtime requirements on smaller systems, yours sounds more like soft realtime requirements.

Another thing you can attempt is significantly reducing object allocations (profile your application!) by using pre-allocated objects or more compact data representations where applicable. Reducing allocation pressure while keeping the new gen size the same means increased mortality rate per collection, which should speed up young collections.

Choosing hardware to maximize memory bandwidth might help too.

Invoke System.gc() or Runtime.gc() between requests, again hopelessly praying for it to help,

This might work when combined with -XX:+ExplicitGCInvokesConcurrent, otherwise it would trigger a single-threaded STW collection with CMS or G1 (I'm assuming you're using one of those). But that approach is seems brittle and requires lots of tuning and monitoring.

the8472
  • 40,999
  • 5
  • 70
  • 122
  • I gave few tries to the `System.gc()` + `-XX:+ExplicitGCInvokesConcurrent`, but it indeed does not work, mainly because `System.gc()` runs for several minutes. Running full GC seems like not an option. And yes, we were trying both the CMS and G1. Will have to try alternative JVMs and profiling+refactoring if nothing else helps. However, the more thinking about it, the more convinced I am that telling the JVM when to do the collection is exactly what is needed in my case. Tuning GC parameters cannot solve my problem in principle, without giving JVM additional information. – Tregoreg Oct 28 '16 at 14:13
  • Well, you're saying pauses are at 200ms and your SLA is <100ms. So it seems to be within reach if you can squeeze out a factor of 2-3. – the8472 Oct 28 '16 at 14:52