I've got a real head-scratcher of a memory-leak-in-production-on-Azure-app-service-webjob.
This is a background worker process, reading work from a queue and on a schedule. About every 10 minutes, memory usage spikes, before (about 10 minutes later) returning to a normal baseline. Each spike is slightly higher every time. Until eventually, the spike hits high enough (>80% say), whereupon it takes longer and longer to return to baseline, and finally it locks up.
I have very detailed logged in place. There are no huge or large numbers of database queries, and none of the processing operations takes longer than a few seconds. It's hard to get a very clear trace, since there are maybe 15-30 different operations which happen on a 10-min cycle.
Anyway. A while ago while it was at the "maxed-out" phase, I got a full memory dump, plugged it into WinDbg. There were millions of Entity Framework (6) entities of a certain type in memory, despite there only being a few thousand in the database.
I haven't been able to repro locally. So I added some code to the constructor of this entity type - it keeps a Dictionary<string,long>
of the number of times a certain Environment.StackTrace
is seen from the constructor. I'm waiting for the "maxing-out" to happen, but connecting remotely, it looks pretty standard/normal at the moment.
Given that these objects /may/ be increasing over time, that doesn't explain the increasing spikes and return to baseline, though. Does it?
I've also just captured a full memory dump during a "baseline" and then a "small spike" (only been running a few hours, as per the image). I have rudimentary WinDbg skills.
Anyway, my questions / causes of confusion:
- How can I determine the different between two full memory dumps?
- Has anyone seen anything like this before?
- What could cause spikes in memory to GROW each time?
- If it IS a memory leak, why would it ever return to baseline between spikes?
I'd like to think there's no magic going on, but I simply can't find a thing that coincides with the spike:
- The number of database records increases gradually, but is only a few thousand, and the memory issue re-sets if the process is restarted
- No operation seems to take more than a few seconds, as per logging, despite spikes lasting ~10 mins