0

We have a spring boot application that runs on 20 servers and we have a balancer that redirects the requests to our servers. Since last week we are having huge problems with CPU usage (100% in all VM's) almost simultaneously without having any noticeable increase in the incoming requests.

Before that, we had 8 VM's without any issue. In peak hours we have 3-4 thousand users with 15-20k requests per 15 minutes.

I am sure it has something to do with the heap usage since all the CPU usage comes from the GC that tries to free up some memory.

At the moment, we isolated some requests that we thought might cause the problem to specific VM's and proved that those VM's are stable even though there is traffic. (In the picture below you can see the OLD_gen memory is stable in those VM's)

The Heap memory looked something like this enter image description here enter image description here

The memory continues to increase and there are two scenarios, it will either reach a point and after 10 minutes it will drop on its own at 500MB or it will stay there cause 100% CPU usage and stay there forever.

From the heap dumps that we have taken, it seems that most of the memory has been allocated in char[] instances. enter image description here enter image description here enter image description here

We have used a variety of tools (VisualVM, JProfiler etc) in order to try to debug the issue without any luck. I don't know if I am missing something obvious, or something else.

I also tried, to change GC algorithm to G1 from the default and disable hibernate query cache plan since a lot of our queries are using the in parameter for filtering. enter image description here

UPDATE

We managed to reduce the number of requests in our most heavily used API Call and the OLD gen looks like that now. Is that normal?

enter image description here

uses134
  • 25
  • 6
  • 1
    1. I would look into what changed right before the problem started. Compare source code diffs and try to rollback the changes and see which commit reproduces the issue. 2. Is it feasible to try and use another JVM? For example if you use Oracle, you could try and see if OpenJDK behaves differently. This would indicate it is somehow connected to the garbage collector. If this is true, you should look into the application code and see how memory usage could be optimized. – D-FENS Dec 01 '21 at 09:05
  • @roccobaroccoSC We did the first one, unfortunately, it is impossible to roll back the changes that were made, however, they were minor and cannot see why they would impact the application, nevertheless we did refactor them but no luck. I forgot to mention that a week before the problem started the number of requests are doubled. So, we could always have the problem but we never had enough traffic to actually have a problem? If that makes sense. I will try to change 3 or 4 VM's to OpenJDK to find out about your second point. – uses134 Dec 01 '21 at 09:18
  • Are you reading a lot of files, do string manipulation in your code? If so check that, also make sure you haven't implemented your own caching mechanism somewhere and if you are using caches make sure they are tuned. – M. Deinum Dec 01 '21 at 09:32
  • @uses134: If you can rule out the code changes for sure, then the problem must have always been there, you just ran into a bottleneck. In this case try to add more servers and see if this resolves your issue. Also, make sure each node's RAM is fully utilized. The JVM does not automatically take all the machine's memory. You need to set the JVM memory limit from the command line: https://stackoverflow.com/questions/2294268/how-can-i-increase-the-jvm-memory – D-FENS Dec 01 '21 at 09:40
  • I also wonder did you really disable the query plan cache (which property/properties) did you use. You probably want to tune the size etc of that cache. I also wonder do you have other caching going on (like query caching?). – M. Deinum Dec 01 '21 at 09:41
  • @M.Deinum we had a lot of excel, csv exports that we moved to the VM's that apparently are stable, so the problem is not coming from there. We are using Xms 8g and Xmx8g for VM that have 10GB of ram and 10G for those that have 12G – uses134 Dec 01 '21 at 09:45
  • @roccobaroccoSC We are using Xms 8g and Xmx8g for VM that have 10GB of ram and 10G for those that have 12G – uses134 Dec 01 '21 at 09:46
  • @uses134: Then increase the number of servers. Maybe you hit your limit and need to scale up. – D-FENS Dec 01 '21 at 09:47
  • @M.Deinum we didnt have any cache enabled, we had some dead code that should not have any impact, however, we did clean this up as well. The properties for disabling cache we used are hibernate.cache.use_second_level_cache: false hibernate.cache.use_query_cache: false -Dspring.jpa.properties.hibernate.query.plan_cache_max_size=64 -Dspring.jpa.properties.hibernate.query.plan_parameter_metadata_max_size=3 – uses134 Dec 01 '21 at 09:48
  • If those exports where ineffecient and with a double number of requests (not sure if that triggers the exports) that could actually be an issue. If it isn't related to the changes it is due to your load and some code that was inefficient with regards to memory is now producing more garbage, which cannot be collected in a timely manner. What I do find weird is that increasing the servers didn't resolve that as one would expect the load to spread out. – M. Deinum Dec 01 '21 at 09:49
  • @M.Deinum We isolated those requests to specific VM's and those VM's memory is pretty much stable (they do increase in memory when handling the export requests but they drop immediately after they finished doing the job) and there was no improvement in the rest of the VM's even though the traffic is smaller. – uses134 Dec 01 '21 at 10:22
  • Is it possible to have a test enviroment running with the old and the "refactored" codebases to make sure it wasn't the refactoring? What kind of requests are being sent to the VMs that are bottlenecking? Also how are you managing multithreading? Might this be an issue of requests not completing because there are too many coming in? – SirHawrk Dec 01 '21 at 11:00
  • Did you by any change upgrade the JDK? Also which JDK version are you using on your VM? Why, well apparently there is a bug in the JDK (https://bugs.openjdk.java.net/browse/JDK-8277981). – M. Deinum Dec 02 '21 at 08:30

0 Answers0