2

I've got a real head-scratcher of a memory-leak-in-production-on-Azure-app-service-webjob.

memory usage

This is a background worker process, reading work from a queue and on a schedule. About every 10 minutes, memory usage spikes, before (about 10 minutes later) returning to a normal baseline. Each spike is slightly higher every time. Until eventually, the spike hits high enough (>80% say), whereupon it takes longer and longer to return to baseline, and finally it locks up.

I have very detailed logged in place. There are no huge or large numbers of database queries, and none of the processing operations takes longer than a few seconds. It's hard to get a very clear trace, since there are maybe 15-30 different operations which happen on a 10-min cycle.

Anyway. A while ago while it was at the "maxed-out" phase, I got a full memory dump, plugged it into WinDbg. There were millions of Entity Framework (6) entities of a certain type in memory, despite there only being a few thousand in the database.

I haven't been able to repro locally. So I added some code to the constructor of this entity type - it keeps a Dictionary<string,long> of the number of times a certain Environment.StackTrace is seen from the constructor. I'm waiting for the "maxing-out" to happen, but connecting remotely, it looks pretty standard/normal at the moment.

Given that these objects /may/ be increasing over time, that doesn't explain the increasing spikes and return to baseline, though. Does it?

I've also just captured a full memory dump during a "baseline" and then a "small spike" (only been running a few hours, as per the image). I have rudimentary WinDbg skills.

Anyway, my questions / causes of confusion:

  1. How can I determine the different between two full memory dumps?
  2. Has anyone seen anything like this before?
  3. What could cause spikes in memory to GROW each time?
  4. If it IS a memory leak, why would it ever return to baseline between spikes?

I'd like to think there's no magic going on, but I simply can't find a thing that coincides with the spike:

  • The number of database records increases gradually, but is only a few thousand, and the memory issue re-sets if the process is restarted
  • No operation seems to take more than a few seconds, as per logging, despite spikes lasting ~10 mins
Kieren Johnstone
  • 41,277
  • 16
  • 94
  • 144

1 Answers1

0

How can I determine the different between two full memory dumps?

Quite hard to do in WinDbg. It's much easier with a memory analysis tool such as Jetbrains dotMemory, which can import raw dumps, if you took care using the right format.

Has anyone seen anything like this before?

Yes.

What could cause spikes in memory to GROW each time? There were millions of Entity Framework (6) entities of a certain type in memory, despite there only being a few thousand in the database.

If you have a O(n²) loop like

foreach(...)
    foreach(...)
        CreateAnObject();

then 1000 lines in the database may create 1.000.000 objects. If you just add one more line in the database, then there are 2001 more objects next time you run the same query.

If it IS a memory leak, why would it ever return to baseline between spikes?

I would not call the spike behavior a memory leak. It looks quite ok. However, you need to consider that at some point in time, RAM is no longer sufficient and swapping to hard disk occurs. Your application then becomes much slower. Perhaps you can change the algorithm.

However, note that the baseline is not constant:

Baseline

So you indeed have a memory leak, but it's not related to the spikes. Instead of comparing a spike to a non-spike, I would compare two baselines.

The number of database records increases gradually, but is only a few thousand, and the memory issue re-sets if the process is restarted

That might be resolved if the basline leak is fixed.

No operation seems to take more than a few seconds, as per logging, despite spikes lasting ~10 mins

No operation? What's an opertaion for you? A method call? How did you ensure that you measure all method calls? Next time, you might want to add a CPU% graph as well.

Thomas Weller
  • 55,411
  • 20
  • 125
  • 222
  • Ah, I have ANTS but not dotMemory, thanks I'll give it a go. I see what you're saying - perhaps there's an underlying pool of entities (in some pinned EF context maybe), and each time it queries the same set, the pool is compared against, and added to. I have looked at some `!gcroot` output for some of the entities hanging around, but it's hard to understand. I'll give dotMemory a go.. – Kieren Johnstone Nov 24 '18 at 12:00
  • Re: an operation, anything that the worker process does is wrapped up in an 'event', which I time the runtime of. CPU usage roughly-speaking spikes in tandem with the memory spikes – Kieren Johnstone Nov 24 '18 at 12:02
  • @KierenJohnstone: do you unregister the event handler correctly? Otherwise it might fire multiple times, calculating the same results over and over again – Thomas Weller Nov 24 '18 at 13:11
  • It's not an actual event handler in the .NET sense, it deserialises an object called an Event, the type of which has a DI-registered handler, if you see what I mean. dotMemory gives me a message that it can't find a mscordacwks.dll with the same precise version on the server (v4.7.3163.00, x64). Older than the one in my microsoft.net\framework64\v4.0.30319 folder, weird. Edit: grabbed from the server itself, here we go.. – Kieren Johnstone Nov 24 '18 at 13:14
  • aaand as with ANTS memory profiler, it's a flaky, crashy, buggy endeavour. Load one dump = ok. Load a second? Goes back to the dashboard. Try again? Takes 10 mins, same thing. Sigh.. – Kieren Johnstone Nov 24 '18 at 13:46
  • Ok, so there are a load of objects kept hanging around, but there are lots of EF contexts, so it's not like one DB context is growing with entities. Any ideas what might be causing the spikes? – Kieren Johnstone Nov 24 '18 at 16:46