We have a major challenge which have been stumping us for months now.
A couple of months ago, we took over the maintenance of a legacy application, where the last developer to touch the code, left the company several years ago.
This application needs to be more or less always online. It's developed many years ago without staging and test environments, and without a redundant infrastructure setup.
We're dealing with a legacy Java EJB application running on Payara application server (Glassfish derivative) on an Ubuntu server.
Within the last year or two, it has been necessary to restart Payara approximately once a week, and the Ubuntu server once a month.
This is due to a memory leak which slows down the application over a period of around a week. The GUI becomes almost entirely non-responsive, but a restart of Payara fixes this, at least for a while.
However after each Payara restart, there is still some kind of residual memory use. The baseline memory usage increases, thereby reducing the time between Payara restarts. Around every month, we thus do a full Ubuntu reboot, which fixes the issue.
Naturally we want to find the memory leak, but we are unable to run a profiler on the server because it's resource intensive, and would need to run for several days in order to capture the memory leak.
We have also tried several times to dump the heap using "gcore" command, but it always result in a segfault and then we need to reboot the Ubuntu server.
What other options / approaches do we have to figure out which objects in the heap are not being garbage collected?