6

We are running Asp.Net WebApi on 3 servers behind HAProxy. HAProxy simply randomly distributes requests among these 3 instances.

These instances connect to mongodb, redis and some windows services.

Normally, w3wp.exe uses about %30 cpu on each api server.

From time to time (a few times in an hour) one of the api servers decide to use high amount of cpu. In correlation with this behavior, we start to see increasing response times. The numbers keep raising until HAProxy sees 10000ms response times and decides to route requests to other two servers. All these occur in 10-20 seconds. After a while, this server goes back to its normal state and start taking requests again. After a few minutes, another server does exactly the same thing. This keeps going on and on.

We are using New Relic but since the application is a WebApi application, we do not get any useful info. We monitor all our servers (redis, mongo and windows services) for cpu usage, memory usage, network traffic and I/O but we do not see any significant load during aforementioned outages.

How can we detect the cause behind this application behavior?

Serhat Ozgel
  • 23,496
  • 29
  • 102
  • 138
  • 1
    Did you resolve this issue? I am having similar issue with web api. For certain call, user gets no response and w3wp.exe getting high memory usage. Can you share how you examined your issue? – Tae-Sung Shin Feb 19 '15 at 18:39

3 Answers3

1

A good option would be to take a mini-dump using something like Process Explorer and then inspect it with WinDBG or otherwise, to see what the threads are doing and so forth. I have a good blog post about how to do it here:

http://www.haneycodes.net/but-it-didnt-happen-in-dev-or-qa/

Haney
  • 32,775
  • 8
  • 59
  • 68
0

As DavidH has said, getting a memory dump is a really important step. If you want, I can offer help to read the dump.

Another useful too is CPU Analyser which is free: http://samsaffron.com/archive/2009/11/11/Diagnosing+runaway+CPU+in+a+Net+production+application

Another option is to use PerfView.

Yet another option is to use JetBrains dotTrace and attach to the w3wp.exe process.

Aliostad
  • 80,612
  • 21
  • 160
  • 208
0

One thing shared between .NET and Java EE is the garbage collector. So, if your application uses large amounts of memory then the periods of high CPU could be the garbage collector coming in. I had this problem with .NET 3.5 IIS 7 running an application that consistently used over a gigabyte per process. The Garbage Collector basically stops everything while it is recovering memory for your application. You can tweak the garbage collector and even call it from your code when it makes sense. There are a lot of little strategies you can use. Another problem will come up with the GC if you are doing lots and lots of string stuff. For example, you are parsing character strings coming through a Restful Web service. This causes a lot of memory fragmentation and can cause the GC to spend a lot more time and CPU recovering memory.

Its easy to see this happening if this indeed is what is going on. You can use the Taskmanager to watch the memory usage and CPU of the process. Look at the memory used when the CPU goes up and after it goes down again.

Arjan Tijms
  • 37,782
  • 12
  • 108
  • 140
Jack D Menendez
  • 154
  • 1
  • 7
  • The application uses very little memory since it does not store any session and the only thing taking up memory space is locally created instances. – Serhat Ozgel Jun 21 '13 at 05:55
  • THe way memory works in .Net is that memory is not recovered for a while until the GC algorithm determines that 'now' would be a good time. The frequency of GC is determined by the settings that you use and the rate that your application uses and releases memory. Storing stuff in session variables does not really matter. For example, parsing a lot of strings does matter because your are creating little chunks of memory that have to be reclaimed and recombined; parsing strings will drive up the amount of time the GCC will run causing a high CPU utilization and a system that is unresponsive. – Jack D Menendez Jun 24 '13 at 16:31