Profiling memory leak in a non-redundant uptime-critical application

Question

We have a major challenge which have been stumping us for months now.

A couple of months ago, we took over the maintenance of a legacy application, where the last developer to touch the code, left the company several years ago.

This application needs to be more or less always online. It's developed many years ago without staging and test environments, and without a redundant infrastructure setup.

We're dealing with a legacy Java EJB application running on Payara application server (Glassfish derivative) on an Ubuntu server.

Within the last year or two, it has been necessary to restart Payara approximately once a week, and the Ubuntu server once a month.

This is due to a memory leak which slows down the application over a period of around a week. The GUI becomes almost entirely non-responsive, but a restart of Payara fixes this, at least for a while.

However after each Payara restart, there is still some kind of residual memory use. The baseline memory usage increases, thereby reducing the time between Payara restarts. Around every month, we thus do a full Ubuntu reboot, which fixes the issue.

Naturally we want to find the memory leak, but we are unable to run a profiler on the server because it's resource intensive, and would need to run for several days in order to capture the memory leak.

We have also tried several times to dump the heap using "gcore" command, but it always result in a segfault and then we need to reboot the Ubuntu server.

What other options / approaches do we have to figure out which objects in the heap are not being garbage collected?

Have you tried [jmap](https://stackoverflow.com/q/15130956/2541560)? — Kayaman, Apr 06 '22 at 17:51
Yes, we first tried jmap but it crashes the server. Then we tried doing a core dump with gcore, with the intent of getting a core dump file, move it to another server and run jmap there. But running gcore gives a segfault, subsequent server crash, and it generates only an unusable core file. — Morten Kirsbo, Apr 07 '22 at 08:23
So if you take a heap dump with `jmap` after a clean restart, it crashes the server? I think you'll need to stabilize your environment first before you attempt to solve any memory issues, although maybe those will be fixed once you stabilize your env. — Kayaman, Apr 07 '22 at 08:26
We haven't tried to do jmap after a clean restart. We've only done jmap, after we see the performance is being impacted, i.e. After some 3-4 days or the application running. This is because of the assumption that jmap after a clean restart won't show us the memory leak anyway, since it takes days for the memory leak to present itself (i.e. heap memory use increasing). — Morten Kirsbo, Apr 07 '22 at 10:00
Don't wait until it's too late to take heap dumps. It seems like you're waiting until the server is on fire before you're doing anything, when you should be constantly monitoring it for any suspicious behaviour. Your problem isn't tooling, it's waiting too long. — Kayaman, Apr 07 '22 at 10:36
If we take a heap dump after a clean restart, it's not going to show us what objects are accumulating without being garbage collected. And the server crashes when we do jmap even after a couple of days. It's only around day 7 that the application becomes unresponsive, so on day 2-3 it's not on fire, but heap usage have clearly increased. — Morten Kirsbo, Apr 07 '22 at 10:43
I **know**, but if it crashes after a clean restart, then you might as well forget the idea altogether. You can't analyze it if you can't get diagnostics data out of it, and if you're getting segfaults then maybe the server should be on fire (and replaced with a working one). — Kayaman, Apr 07 '22 at 10:50
What do you mean by the "server is on fire"? It's working fine in general. What do you propose to check / test? — Morten Kirsbo, Apr 09 '22 at 19:57
What is frightening me a bit in your question is the part "after each Payara restart, there is still some kind of residual memory use." - So when you stop the application server, something is using more heap than before you started it, and only goes away after a reboot of the operating system? Can you identify what that is? Maybe some component that the application server is communicating with? — cyberbrain, Apr 10 '22 at 13:04
Try `jmap -histo ` . A histogram is normally much faster than a full heap dump. It has fewer details, but it might help you to find some clues. — Tharanga Hewavithana, Apr 14 '22 at 14:10
Does this server connect to any other database or a service? — Tharanga Hewavithana, Apr 14 '22 at 14:11

score 0 · Answer 1 · answered Apr 10 '22 at 12:56

I would try to clone the server in some way to another system where you can perform tests without clients being affected. Could even be a system with less resources, if you want to trigger a resource based problem.

To be able to observe the memory leak without having to wait for days, I would create a load test, maybe with Apache JMeter, to simulate accesses of a week within a day or even hours or minutes (don't know if the base load is at a level where that is feasible from the server and network infrastructure).

First you could set up the load test to act as a "regular" mix of requests like seen in the wild. After you can trigger the loss of response, you can try to find out, if there are specific requests that are more likely to be the cause for the leak than others. (It also could be that some basic component that is reused in nearly any call contains the leak, and so you cannot find out "the" call with the leak.)

Then you can instrument this test server with a profiler.

To get another approach (you could do it in parallel) you also can use a static code inspection tool like SonarQube to analyze the source code for typical patterns of memory leaks.

And one other idea comes to my mind, but it is coming with many preconditions: if you have recorded typical scenarios for the backend calls, and if you have enough development resources, and if it is a stateless web application where each call could be inspoected more or less individually, then you could try to set up partial integration tests where you simulate the incoming web calls, with database and file access, but if possible without the application server, and record the increase of the heap usage after each of the calls. Statistically you might be able to find out the "bad" call this way. (So this would be something I would try as very last option.)

Why the negative? I am agree about creating a test environment to analyze the issue :) — Franco Berardi, Apr 16 '22 at 19:55

score 0 · Answer 2 · answered Apr 12 '22 at 07:56

Apart from heap dump have to tried any realtime app perf monitoring (APM) like appdynamics or the opensource alternative like https://github.com/scouter-project/scouter.
Alternate approach would be to analyse existing application issue Eg: Payara issues like these https://github.com/payara/Payara/issues/4098 or maybe the ubuntu patch you are currently running app on.

Arsh Coder · Answer 3 · 2022-04-16T13:46:43.463

You can use jmap, an exe bundled with the JDK, to check the memory. From the documentation:-

jmap prints shared object memory maps or heap memory details of a given process or core file or a remote debug server.

For more information you can see the documentation or see the stackoverflow question How to analyse the heap dump using jmap in java

There is also a tool called jhat which can be used tp analise java heap. From the documentation:-

The jhat command parses a java heap dump file and launches a webserver. jhat enables you to browse heap dumps using your favorite webbrowser. jhat supports pre-designed queries (such as 'show all instances of a known class "Foo"') as well as OQL (Object Query Language) - a SQL-like query language to query heap dumps. Help on OQL is available from the OQL help page shown by jhat. With the default port, OQL help is available at http://localhost:7000/oqlhelp/

See JHat Dcoumentation, or How to analyze the heap dump using jhat

Profiling memory leak in a non-redundant uptime-critical application

3 Answers3