We are running a RT system in Java. It often uses relatively large heaps (100+GB) and serves requests coming from message queue. Each request must be handled fast (<100ms) to meet the SLAs.
We are experiencing serious GC-related problems, because it often happens that GC causes stop-the-world collection during a request (200+ms), resulting in failure.
One of our developers with reasonable knowledge of GCs spent quite some time with tuning GC parameters and trying different GCs. After several days, he came up with some parametrization that we jokingly call "evolved by genetic algorithm". It lowers the GC pauses, but is still far from meeting the SLA requirements.
The solution I am looking for is to protect some critical parts of code from GC, and after a request is finished, let the GC do as much work as it needs, before taking next request. Occasional pauses outside the requests would be OK, because we have several workers and garbage-collecting workers would just not ask for requests for a while.
I have some ideas which are silly, ugly, and most probably not working, but hopefully they illustrate the problem:
- Occasionally call
Thread.sleep()
in the receiving thread, praying for the GC to do some work in the meantime, - Invoke
System.gc()
orRuntime.gc()
between requests, again hopelessly praying for it to help, - Mess the code entirely with hacky patterns like https://stackoverflow.com/a/6915221/1137187.
The last important note is that we are a low-budget startup and commercial solutions such as Zing® are not an option for us, we are looking for a non-commercial solution.
Any ideas? We would rewrite our code entirely to C++ (we didn't know that GC might be a problem rather than solution at the beginning), but the code-base is too large already to do that.