16

I have a program that is constantly running. Normally, it seems to garbage collect, and remain under about 8MB of memory usage. However, every weekend, it refuses to garbage collect unless I make an explicit call to it. However, if it nears the maximum heap size, it will still garbage collect. However the only reason this issue was noticed, is because it actually crashed from running out of memory on one weekend i.e. it must have reached the maximum heap size, and not run the garbage collector.

The following image (click to see) is a graph of the program's memory usage over a day. On the sides of the graph, you can see the normal behaviour of the program's memory usage, but the first large peak is what seems to start over the weekend. This particular graph is a strange example, because after I made an explicit call to the garbage collector, it ran successfully, but then it went and climbed back to the maximum heap size and successfully garbage collected on it's own twice.

What is going on here?

EDIT:

Ok, from the comments, it seems I haven't provided enough information. The program simply receives a stream of UDP packets, which are placed in a queue (set to have a maximum size of 1000 objects), which are then processed to have their data stored in a database. On average, it gets about 80 packets per second, but can peak to 150. It's running under Windows Server 2008.

The thing is, this activity is fairly consistent, and if anything, at the time that the memory usage starts it's steady climb, the activity should be lower, not higher. Mind you, the graph I posted above is the only one I have that extends back that far, since I only changed the Java Visual VM wrapper to keep graph data back far enough to see it this week, so I have no idea if it's exactly the same time every week, because I can't watch it over the weekend, as it's on a private network, and I'm not at work on the weekend.

Here is a graph of the next day: alt text

This is pretty much what the memory usage looks like every other day of the week. The program is never restarted, and we only tell it to garbage collect on a Monday morning because of this issue. One week we tried restarting it on a Friday afternoon, and it still started climbing sometime over the weekend, so the time that we restart it doesn't seem to have anything to do with the memory usage next week.

The fact that it successfully garbage collects all those objects when we tell it to implies to me that the objects are collectable, it's just not doing it until it reaches the maximum heap size, or we explicitly call the garbage collector. A heap dump doesn't tell us anything, because when we try to perform one, it suddenly runs the garbage collector, and then outputs a heap dump, which of course looks perfectly normal at this point.

So I suppose I have two questions: Why is it suddenly not garbage collecting the way it does the rest of the week, and why is it that on one occassion, the garbage collection that occurs when it reaches the maximum heap size was unable to collect all those objects (i.e. why would there be references to so many objects that one time, when every other time there must not be)?

UPDATE:

This morning has been an interesting one. As I mentioned in the comments, the program is running on a client's system. Our contact in the client organisation reports that at 1am, this program failed, and he had to restart it manually when he got into work this morning, and that once again, the server time was incorrect. This is an issue we've had with them in the past, but until now, the issue never seemed to be related.

Looking through the logs that our program produces, we can deduce the following information:

  1. At 01:00, the server has somehow resynced it's time, setting it to 00:28.
  2. At 00:45 (according to the new, incorrect server time), one of the message processing threads in the program has thrown an out of memory error.
  3. However, the other message processing thread (there are two types of messages we receive, they are processed slightly differently, but they are both constantly coming in), continues to run, and as usual, the memory usage continues to climb with no garbage collection (as seen from the graphs we have been recording, once again).
  4. At 00:56, the logs stop, until about 7am when the program was restarted by our client. However, the memory usage graph, for this time, was still steadily increasing.

Unfortunately, because of the change in server time, this makes the times on our memory usage graph unreliable. However, it seems to be that it tried to garbage collect, failed, increased the heap space to the maximum available size, and killed that thread all at once. Now that the maximum heap space has increased, it's happy to use all of it without performing a major garbage collection.

So now I ask this: if the server time changes suddenly like it did, can that cause a problem with the garbage collection process?

Ogre
  • 781
  • 3
  • 10
  • 30
  • 4
    java vendor, version and operating system info would definitely help (also any command-line options You provide when running Your program) – barti_ddu Nov 29 '10 at 23:31
  • The picture shows a Monday, which isn't the weekend. In any case, I don't think there is enough information. I can tell you that typically (JLS specified?) an OOM will not happen if there is enough memory that can be freed, so if you are seeing a climb in memory usage ending in an OOM than that memory just isn't reclaimable. – Tim Bender Nov 29 '10 at 23:32
  • 6
    you might want to look at the dump it creates when it goes out of memory. this has been helpful for us in the past. add a -XX:-HeapDumpOnOutOfMemoryError and -XX:HeapDumpPath=/path/to/dump. It will create an hprof file that you can debug with eclipse MAT(Memory Analyzer tool) – mezzie Nov 29 '10 at 23:42
  • Also, it is unlikely to guess the reason without any knowledge about Your application. Does it create large number of objects? Does it use external libraries (for instance, there may be memory leak in c library wrapper)? Does it do intensive I/O? Does it run in client or server mode? – barti_ddu Nov 29 '10 at 23:45
  • To me, there's two sides: what is your program doing? as well as: what is the jvm doing? By that, I mean this: does the jvm act differently w.r.t. garbage collection on weekends? Does your incoming stream of UDP packets maintain a steady throughput through the whole week? Are the contents of the packets meaningfully different on weekends? In the absence of further detail, I might suggest trying a different vendor's jvm and see if it does something similar or not. Move the computer's time back a few days and see if it thinks it's a weekend, and behaves like you think it does. – weiji Nov 30 '10 at 23:54
  • Unfortunately, we can't change the server time. This program is part of a bigger system where the current system time must be correct, and it's actually on a client's server, on their private network, so we can't actually get the UDP packets to our own server for testing. But no, the incoming stream of UDP packets doesn't really change, and the contents of the packets will always mean the same thing. Actually, what I will do, is I'll write a dummy data source that will simulate the real data for local testing, and then change the local system time to see what happens. – Ogre Dec 01 '10 at 05:25

3 Answers3

11

However the only reason this issue was noticed, is because it actually crashed from running out of memory on one weekend i.e. it must have reached the maximum heap size, and not run the garbage collector.

I think your diagnosis is incorrect. Unless there is something seriously broken about your JVM, then the application will only throw an OOME after it has just run a full garbage collect, and discovered that it still doesn't have enough free heap to proceed*.

I suspect that what is going on here is one or more of the following:

  • Your application has a slow memory leak. Each time you restart the application, the leaked memory gets reclaimed. So, if you restart the application regularly during the week, this could explain why it only crashes on the weekend.

  • Your application is doing computations that require varying amounts of memory to complete. On that weekend, someone sent it a request that required more memory that was available.

Running the GC by hand is not actually going to solve the problem in either case. What you need to do is to investigate the possibility of memory leaks, and also look at the application memory size to see if it is large enough for the tasks that are being performed.

If you can capture heap stats over a long period, a memory leak will show up as a downwards trend over time in the amount of memory available after full garbage collections. (That is the height of the longest "teeth" of the sawtooth pattern.) A workload-related memory shortage will probably show up as an occasional sharp downwards trend in the same measure over a relatively short period of time, followed by a recovery. You may see both, then you could have both things happening.

* Actually, the criteria for deciding when to give up with an OOME are a bit more complicated than this. They depend on certain JVM tuning options, and can include the percentage of time spent running the GC.

FOLLOWUP

@Ogre - I'd need a lot more information about your application to be able to answer that question (about memory leaks) with any specificity.

With your new evidence, there are two further possibilities:

  • Your application may be getting stuck in a loop that leaks memory as a result of the clock time-warping.

  • The clock time-warping may cause the GC to think that it is taking too large a percentage of run time and trigger an OOME as a result. This behaviour depends on your JVM settings.

Either way, you should lean hard on your client to get them to stop adjusting the system clock like that. (A 32 minute timewarp is way too much!!). Get them to install a system service to keep the clock in sync with network time hour by hour (or more frequent). Critically, get them to use a service with an option to adjusts the clock in small increments.

(Re the 2nd bullet: there is a GC monitoring mechanism in the JVM that measures the percentage of overall time that the JVM is spending running the GC, relative to doing useful work. This is designed to prevent the JVM from grinding to a halt when your application is really running out of memory.

This mechanism would be implemented by sampling the wall-clock time at various points. But if the wall-clock time is timewarped at a critical point, it is easy to see how the JVM may think that a particular GC run took much longer than it actually did ... and trigger the OOME.)

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
  • Sorry, I didn't respond to your answer because I wasn't sure how to respond at the time. But now I have to ask: how can it be a memory leak, when the memory usage never goes above 8mb until that time at approximately 1am on monday morning? The program is not written to do anything differently at that time. – Ogre Dec 06 '10 at 00:11
  • Well, I would say that the second bullet point in your follow up is correct, although it doesn't explain why the memory usage keeps climbing if it does not trigger the OOME, and only garbage collects when it does reach it's maximum heap size. Still, I'm happy with doing something to ensure that the server time is correct, since the rest of the system requires that time to be correct anyway, due to time sensitive data. – Ogre Dec 13 '10 at 02:20
  • @Ogre - that's easy to explain: it is simply the way that all mark/sweep and copying GC works. GC runs most efficiently when you have the maximum amount of collectable garbage. That happens when there is no free space left in the heap you are trying to allocate from. The result is that memory usage follows a sawtooth pattern. – Stephen C Dec 13 '10 at 05:08
3

If possible, I would setup the process to dump the heap if it runs out of memory - so you can analyze it if (when) it happens again. Not an answer, but a potential route to a solution.

Here are the JVM options, taken from Oracle's Java HotSpot VM Options page. (This assumes you have an Oracle JVM):

-XX:HeapDumpPath=./java_pid.hprof

Path to directory or filename for heap dump. Manageable. (Introduced in 1.4.2 update 12, 5.0 update 7.)

-XX:-HeapDumpOnOutOfMemoryError

Dump heap to file when java.lang.OutOfMemoryError is thrown. Manageable. (Introduced in 1.4.2 update 12, 5.0 update 7.)

Joshua McKinnon
  • 24,489
  • 11
  • 57
  • 63
2

Ok guys, thanks for all your help. The correct answer however, turned out to have nothing to do with the program itself.

It seems that at the time that the memory usage started it's steady climb, the server was synchronising it's time from somewhere internal, though our client's IT contact has no idea where. Obviously, wherever it was coming from, was not a good clock, since the time was half an hour behind. We turned off this synchronisation, and now that I have checked it again this morning, the problem did not occur. So if the time on your system suddenly changes, apparently this causes problems for the garbage collector. At least that's what this implies to me.

As for why this wasn't occuring on any other parts of our system on this server (which are also written in Java), we probably simply haven't noticed, as they don't deal with as large a number of objects, and so they would never have hit their out of memory state.

I find this strange, since I would have thought that the invoking of the garbage collector would be completely related to memory usage, and not on the system time at all. Clearly, my understanding of how the garbage collector works is woefully inadequate.

Ogre
  • 781
  • 3
  • 10
  • 30
  • timewarping causing OOMEs is explicable; see my answer. – Stephen C Dec 12 '10 at 23:49
  • the collector does take the time it's spent running into account, but I'd be surprised if it uses the wall clock for this. Do you have verbose GC logs over the time period? – SimonJ Dec 13 '10 at 00:26