1

I am investigating an incident where an OutOfMemoryException has been thrown in production, for a "traditional" .NET server application. My purpose is to interpret a specific portion of the data gathered through Performance Monitor and seek some advice on how to move on. Let me start with a list of facts:

  1. The process had been running for over 20 days until the crash.
  2. It crashed because an exception of type System.OutOfMemoryException was thrown.
  3. There have been similar incidents in the past. Similarly, it takes a long time for the application to crash.
  4. The process had been monitored through Performance Monitor by the following counters: # Bytes in all Heaps, % Processor Time, Private Bytes, Working Set.
  5. We cannot capture any memory dumps at the production environment, and we haven't been able to reproduce it.

In the first screenshot, you can see the overall behavior of the counters in a span of 7 days. Things are pretty much stable. The second screenshot shows the behavior over the last minute, around the crash. The OutOfMemoryException was logged on 3:13:49PM.

7day behaviour Last minute around the crash

My questions are: 1. Any ideas what does the sudden increase of Working Set mean? It was overall stable at 650ish MB, and in 10 seconds it climbed up to 1,3GB. 2. Should I focus on finding something that triggered the OOM just before the crash, or could it be an accumulative factor? As you've seen, Private Bytes and Bytes on all Heaps are pretty much stable.

nantito
  • 364
  • 1
  • 14
  • Basically what is the server app doing? I take it the load did not change significantly at the time. – paparazzo Dec 13 '12 at 19:01
  • It has many clients connected to it locally, through .NET remoting. Mainly, it handles input requests, it performs the needed operations to process them (including lots of DB queries) and answers back. As for the work load, it was pretty high the moment it crashed, but this is usual around those hours. The number of clients is not an issue. It's a predictable number - no overflow of client connections whatsoever. – nantito Dec 13 '12 at 19:11
  • Sorry not of more help. If it went 20 days stable then I would focus on the last 10 seconds. To go from 650mb to 1.3 in 10 seconds is significant. Does not mean finding that significant cause will be easy. Any working Collections that may have grown unbounded? – paparazzo Dec 13 '12 at 20:27
  • Any input is welcome! The memory increase affects the Working Set counter, but not Private Bytes. That's one of the answer that I'm seeking for: What does that WorkingSet increase mean? I've dealt with other OOM situations in the past, put there has always been a visible memory leak (private byte increase etc.). – nantito Dec 18 '12 at 21:45
  • Private Bytes generally covers pages that your app has touched. Working Set covers all assets private bytes, shared dlls, memory mapped files etc. that have been recently read and/or written. PB is flat, so the pages your app directly touched don't seem to have been the root cause. Do the GC Heaps sizes echo the flat PB values? You loading a new DLL, COM or something similar maybe? Stack dumps out of the OOM exception? I'd go Remote Debugger or Autocrash dump next - http://blogs.msdn.com/b/dotnet/archive/2009/10/15/automatically-capturing-a-dump-when-a-process-crashes.aspx – stephbu Dec 18 '12 at 22:05

1 Answers1

1

These kinds of problems are exceedingly difficult to diagnose. It is quite possible that what is happening is not the result of a single condition that triggers the behaviour, but a set of simultaneous conditions.

Here's what we know:

  1. No cumulative problem indicated: If the problem was cumulative, we would expect to see some sign of that of the 20 day period leading up to the event. This does not mean that the preceding operation can be ignored. It is possible that some of the conditions that trigger the behaviour are staged, and start earlier on. This is something we cannot know with the information available.

  2. Heaps are stable: The Private Bytes measure tells us how much memory has been reserved (not touched, as stephbu suggested). Bytes-in-all-Heaps tells us how much of the reserved memory is current allocated according to the memory manager (GC). Since both of these are stable, it would seem that the problem isn't necessarily a memory leak. The danger is that we only have 10 seconds of interesting data, and since GC is usually fairly passive, it isn't clear how accurate those statistics would be (particular with the wonky working set).

  3. Working set indicates thrashing: The Working Set tells us how much physical memory the OS wants to keep paged-in to ensure reasonable performance. A growing working set indicates thrashing. A growing working set is normally associated with two things:

    • increased allocation rate

    • increased object longevity (often temporary)

    Increased object longevity is not indicated, because the heaps are not showing growth. Increased allocation rate is possible, but the objects are still short-lived (since a leak is not indicated).

These observations suggest to me that some rare event (or set of events) is triggering a condition in which there is:

  • a high allocation rate

  • of moderately large objects

  • which are not very long-lived

  • GC is thrashing as a result

There are other reports of these conditions causing OutOfMemoryExceptions. I'm not all that certain why it happens. If you are running a 32-bit environment, then a possible reason is fragmentation of the address space. This can happen if the GC cannot obtain contiguous pages from the OS.

Another possibility (which I cannot verify) is that the GC requests the OS to not page-out parts of the heap it is working on. If the number of locked pages gets high, an out-of-memory might result. This idea is almost total speculation as I do not know enough about the internals of Microsofts GC implementation.

I don't have any better explanations right now, but I'd definitely like a better explanation if anyone can provide one.

Finally, you might want to verify that a reasonable Latency Mode is enabled. If this was the problem, I think we would have seen an escalation of Bytes-in-all-Heaps -- so it's probably ok.

PS

Can you check what variable is indicated by the dashed line in the second chart? If it is processor use, then it is consistent with thrashing. As the need to page-in content more frequently increases, disk IO should increase, and (at a certain point) percentage processor use should decline, because everything is waiting for the disk. This is just an extra detail -- if the processor use doesn't decline excessively, thrashing is still a posibility. This is because parts of the software might still exhibit good locality and be able to make progress.

Community
  • 1
  • 1
Kevin A. Naudé
  • 3,992
  • 19
  • 20
  • Thank you a lot for your detailed input. Indeed, the dashed line is CPU usage, which was really high the few hours before the OOM (starting with the morning shift). CPU usage started declining rapidly just when the Working Set started increasing. Here's the graph of the day that the crash occurred: http://i.imgur.com/JYLlM.jpg – nantito Dec 19 '12 at 14:51
  • Perhaps the last 3 minutes of execution make it more clear: http://i.imgur.com/w3kjT.jpg – nantito Dec 19 '12 at 14:55
  • @nantito Thanks. Those extra charts look to be consistent with what I expected. The sudden drop in CPU use is quite probably due to paging. – Kevin A. Naudé Dec 19 '12 at 15:16