0

I have a .net 4.5 ASP.NET WebAPI application. Deployed in IIS using 1 worker on an 8gig VM with 4 CPUs.

I made changes to it recently (upgraded ServiceStack.Interfaces, ServiceStack.Common, ServiceStack.Redis and a bunch of dependencies) and started noticing that the IIS app pool this app is deployed on recycles about once an hour (give or take a few minutes).

There is nothing in my application logs that show any kind of issues. I collect metrics using telegraf and I do NOT see memory metrics increase at all, as far as all the metrics I look at everything looks absolutely normal and then the app pool recycles.

I looked at the event viewer and filtered the logs by WAS source and see event with ID 5011. Which basically means the IIS worker crash as I understand.

So then I used the DebugDiag and ran it on my local box with the app deployed on my box (I can reproduce the issue locally). It ran for a while and finally got the same event in the event viewer. Looked at the crash analysis logs from DebugDiag and all I see if a bunch of exceptions logged but nothing concrete right before the crash.

At this point I'm not entirely sure what else I can to figure out what's causing the crash so hoping there are more suggestions on what I can do to get more transparency.

What I think is happening is, there is some incompatibility with one of my dependencies and some of the upgraded packages which cause an exception to be thrown which is not handled by anything and crashes the IIS worker.

My application is working perfectly fine, as far as all API endpoints functions wit no issues, memory is NOT increasing, CPU is fine. So as far as I can tell there are no issues upto the crash.

Wondering if anyone knows any tricks to find whats causing the crash and/or handle it, prevent this exception from escaping and crashing the worker.

DKhanaf
  • 365
  • 1
  • 4
  • 18
  • 1
    You should try to get a crash dump with debug diag and then start analysis from this dump. See https://blogs.msdn.microsoft.com/chaun/2013/11/12/steps-to-catch-a-simple-crash-dump-of-a-crashing-process/ – Cyril Durand Dec 09 '19 at 22:59
  • 1
    could you share the event id exception snapshot? In the event viewer, there should have application error along with the 5011 warning. So please check the error message in the event viewer application log.If the error message is related to the clr.dll, please check the crash/ memory leak with windbg and diagnostic:https://stackoverflow.com/questions/18335678/how-to-debug-w3wp-clr-dll-error.This link provides the solution for Event ID 5011:https://learn.microsoft.com/en-us/previous-versions/windows/it-pro/windows-server-2008-R2-and-2008/cc735271(v=ws.10)?redirectedfrom=MSDN – Jalpa Panchal Dec 10 '19 at 07:23
  • for IIS Application Pool Crash and Debug Diag:https://blogs.msdn.microsoft.com/parvez/2016/08/06/iis-application-pool-crash-and-debug-diag/ – Jalpa Panchal Dec 10 '19 at 07:23
  • @JalpaPanchal Ah, so looked up this error from the event log 6D000780 and it looks like this is due to rapid fail protection. Basically it means that there is an exception happening so often that the service shuts itself off. I used the debugdiag but that generated way too much data. I then just subscribed to fist chance exceptions from my app directly and wrote it out to a file and see System.Net.Sockets.SocketException happening quite a bit. Unfortunately not sure where in the code, there is no useful stacktrace in those logs. But that gives me a new thread to pull on. – DKhanaf Dec 11 '19 at 08:24
  • Actually I think I was not setting up DebugDiag right I had a low limit on the stacktrace, just upped it so it should capture the larger stacktrace for each exception. – DKhanaf Dec 11 '19 at 08:56
  • did you capture the full user dump? – Jalpa Panchal Dec 11 '19 at 09:12
  • I can't really capture the user dump because this issue I can't force reproduce, the app has to run for a while and then it happens. If I setup DebugDiag to capture user dump it generates so much data I run out of space on my box way before the issue presents itself. – DKhanaf Dec 12 '19 at 22:43
  • But looking a the first chance exceptions logs it seems like the ServiceStack.Redis driver I added, which has a built in Sentinel Manager is throwing a lot of exceptions and its possible this is what's causing IIS rapid fail protection to trigger. I see a Redis exception then followed by stacktrace that has ServiceStack.Redis.RedisPubSubServer.RunLoop() and every time the exception is thrown the stack trace has more and more of those RunLoop statements until at some point the stack trace has like 50 of those statements and the app crashes. Sounds like some sort of recursive loop to me. – DKhanaf Dec 12 '19 at 22:47

1 Answers1

0

I was able to narrow down with some confidence that the issue lies somewhere within the ServiceStack.Redis RedisPubSubServer. What is the actual issue, I don't know as that would take a lot more time to dig and I've wasted too much time already.

However, piggybacking on some existing code I had (from before ServiceStack supported sentinel) I created a new implementation of the redis client wrapper for the which I call LazySentinelServiceStackClientWrapper; instead of using the built-in sentinel manager, it relies on a custom sentinel provider which I created LazySentinelApiSentinelProvider this implementation attempts to interrogate the available sentinel hosts in random order for master and slave nodes and then I construct a pool using the retrieved read/write and readonly hosts and this pool is used to run the redis operations. The pool is refreshed whenever an error occurs (after a failover). Opposed to the builtin sentinel manager that comes with ServiceStack.Redis which instantiates Redis pubsub server and listens for messages from sentinel whenever configuration changes such as fail-overs occur and updates the managed redis connection pool.

I installed my version of this redis client wrapper into my application has seen no app pool recycle events since (other than the scheduled ones).

enter image description here

Above is the log of app pool recycle events before I disabled the ServiceStack.Redis sentinel manager.

And here's the log of app pool recycle events after installing my new lazy sentinel manager

enter image description here

The first spike is me recycling the app manually and second one is the scheduled 1am recycle. So clearly the issue is solved.

What is the actual reason why the sentinel manager via redis pub sub server is causing IIS rapid fail protection to fire and recycle the app pool I do not know. Maybe someone with much more redis experience and/or IIS experience can attest to that. Also I did not test this in .net core and only tested for a .net 4.5.1 application deployed in IIS but on many different machines including local development machine and beefy production machines.

Finally one last note, that first image which shows all the recycle events, that's on my CI machine which is barely taking any traffic, maybe 1 request every few minutes. So this means the issue is not some memory leak or some resource exhaustion. Whatever the issue is, it happens regardless of traffic, CPU load, memory load, it just happens periodically.

Needless to say I will not be using the builtin sentinel manager at least for now.

DKhanaf
  • 365
  • 1
  • 4
  • 18