Azure Web Role Stress Test - 1000ms blocking operation in AsyncController

Question

Today, I wanted to simulate waiting for a long-running blocking process (5 to 30 seconds) from within an AsyncController in an MVC3 web role. However, to begin, I just started with 1 second, to get things going. Yes, the wisdom of this is questionable, since the blocking operation cannot currently be run asynchronously on an I/O Completion Port to an external service, but I wanted to see what the performance limit is for this particular situation.

In my web role, I deployed 6 small instances. The only controller was an AsyncController, with two simple methods intended to simulate a 1000ms blocking operation.

The MVC3 web role controller was simply this:

public class MessageController : AsyncController
{
    public void ProcessMessageAsync(string id)
    {
        AsyncManager.OutstandingOperations.Increment();
        Task.Factory.StartNew(() => DoSlowWork());
    }

    public ActionResult ProcessMessageCompleted()
    {
        return View("Message");
    }

    private void DoSlowWork()
    {
        Thread.Sleep(1000);
        AsyncManager.OutstandingOperations.Decrement();
    }
}

Next, I applied stress to the web role from Amazon EC2. Using 12 servers, I ramped the load up slowly, and got close to 550 requests/second. Any attempt to push beyond this was met with apparent thread starvation and subsequent errors. I assume that we were hitting the CLR thread limit, which I understand to be 100 threads per CPU. Figuring some overhead for the AsyncController, and an average of 550/6 = 92 requests per second per server for a 1000ms blocking operation seems to fit that conclusion.

Is this for real? I have seen other people say similar things, where they reached 60 to 80 requests per second per instance with this type of load. The load on this system will be comprised mainly of longer-running operations, so 92 requests per second at 1000ms is going way down when the 5000ms tasks come online.

Short of routing the requests for the blocking I/O through multiple separate web role front ends to fan this load out to more cores, is there any way to get higher than this apparent limit of 90 or so requests per second at 1000ms block time? Have I made some kind of obvious error here?

Are you saying that the duration of the blocking process is independent of the current load? Because if it isn't, that's most likely going to be your bottleneck sooner than the number of threads. — svick, Sep 23 '12 at 08:51
The blocking processes will average 5 to 30 seconds. Some will exceed the value of the load balancer timeout value (which seems to be 4 minutes). This is going to be fun. It looks like I will need a few hundred servers to have any kind of throughput. — Pittsburgh DBA, Sep 23 '12 at 15:42
Does the real operation actually need to consume a thread for the full time it's operating? Normally you'd try to be async "all the way down" so you're not using/wasting a thread during the operation. Using Task.Delay would be a better choice if the operation doesn't need to consume a thread of the webapp as it runs — James Manning, Sep 23 '12 at 15:57
In most cases, the real operation will use BeginReceive() on a TopicClient or SubscriptionClient from the ServiceBus namespace. That should free up the thread. I am mainly interested in the capacity of these Azure web role instances with a blocking load. When we have this kind of load, it looks like we may need to fan out to an "array" of web roles to get more cores in on the action. In those cases, it is just a question of who gets blocked, because somebody will on that kind of load. We would much prefer that the front-end API has tons of free capacity, and the bees in the back-end can wait. — Pittsburgh DBA, Sep 23 '12 at 16:02

score 4 · Accepted Answer · edited May 23 '17 at 12:13

I'm sorry I have to say this buy you have been mislead by all the blogs claiming that by simply using Task.Factory.StartNew is the solution to all your problems, well, it's not.

Load test with Task.Factory.StartNew

Take a look on the following load test I did on your code (I changed the sleep to 10 sec. instead of 1 sec. to make it even worse). The test simulates 200 constant users doing a total of 2500 requests. And look at how many failed requests there are due to thread starvation:

enter image description here

As you can see, even if you're using an AsyncController with a Task, thread starvation is still happening. Could it be caused because of the long running process?

Load test with TaskCreationOptions.LongRunning

Did you know you can specify if a task is long running or not? Take a look at this question: Strange Behavior When I Don't Use TaskCreationOptions.LongRunning

When you don't use the LongRunning flag, the task is scheduled on a threadpool thread, not its own (dedicated) thread. This is likely the cause of your behavioral change - when you're running without the LongRunning flag in place, you're probably getting threadpool starvation due to other threads in your process.

Let's see what happens if we change 1 line of code:

    public void ProcessMessageAsync(string id)
    {
        Task.Factory.StartNew(DoSlowWork, TaskCreationOptions.LongRunning);
        AsyncManager.OutstandingOperations.Increment();
    }

Take a look at the load test, what a difference!

enter image description here

What just happened?

As you can see, the LongRunning option seems to make a big difference. Let's add some logging to see what happens internally:

    public void ProcessMessageAsync(string id)
    {
        Trace.WriteLine(String.Format("Before async call - ThreadID: {0} | IsBackground: {1} | IsThreadPoolThread: {2} | Priority: {3} | ThreadState: {4}", Thread.CurrentThread.ManagedThreadId, Thread.CurrentThread.IsBackground,
            Thread.CurrentThread.IsThreadPoolThread, Thread.CurrentThread.Priority, Thread.CurrentThread.ThreadState));
        Task.Factory.StartNew(DoSlowWork, TaskCreationOptions.LongRunning);
        AsyncManager.OutstandingOperations.Increment();
    }

    ...

    private void DoSlowWork()
    {
        Trace.WriteLine(String.Format("In async call - ThreadID: {0} | IsBackground: {1} | IsThreadPoolThread: {2} | Priority: {3} | ThreadState: {4}", Thread.CurrentThread.ManagedThreadId, Thread.CurrentThread.IsBackground,
               Thread.CurrentThread.IsThreadPoolThread, Thread.CurrentThread.Priority, Thread.CurrentThread.ThreadState)); 
        Thread.Sleep(10000);
        AsyncManager.OutstandingOperations.Decrement();
    }

Without LongRunning:

Before async call - ThreadID: 11 | IsBackground: True | IsThreadPoolThread: True | Priority: Normal | ThreadState: Background
Async call - ThreadID: 11 | IsBackground: True | IsThreadPoolThread: True | Priority: Normal | ThreadState: Background

With LongRunning:

Before async call - ThreadID: 48 | IsBackground: True | IsThreadPoolThread: True | Priority: Normal | ThreadState: Background
Async call - ThreadID: 48 | IsBackground: True | IsThreadPoolThread: False | Priority: Normal | ThreadState: Background

As you can see, without LongRunning you are actually using threads from the thread pool, causing the starvation. While the LongRunning option works great in this case, you should always evaluate if you really need it.

Note: Since you're using Windows Azure, you need to take into account that the load balancer will timeout after a few minutes of inactivity.

This makes me smile! I was thinking of trying the LongRunning option, but various blogs sort of poo-pooed the solution as some kind of fringe case last resort. My mistake: I should have tested that, also, instead of glossing over it. Many thanks! I will try this. Do you know if this will circumvent the normal thread injection speed obstacle that the thread pool has? At what point will the system have so many threads that context switching becomes too expensive? — Pittsburgh DBA, Sep 23 '12 at 15:44
Also, yes, this limit on the Azure load balancer is not cool. I suppose nobody can implement a B2B solution where the web call takes longer than 4 minutes? I know I can return a URL for the response that the client can periodically try to retrieve, or I can invoke a callback API on their platform to signal the readiness of the response, but these are legacy clients and they can invoke a GET/PUT/POST/DELETE. The rest of it is way out of scope. This 4 minute thing is no good. Do you know of any way to keep the connection alive longer than this? — Pittsburgh DBA, Sep 23 '12 at 15:54
This put it through the roof, so to speak. We had a baseline of 550/sec across 6 instances. With this, it went to 1400/sec before having issues. I reduced the blocking load duration to 500ms and it was able to accommodate up to 2100/sec before problems manifested. This is much better. Once we are doing async I/O in the controller, I expect this to go even higher. Thank you! — Pittsburgh DBA, Sep 23 '12 at 17:50

Azure Web Role Stress Test - 1000ms blocking operation in AsyncController

1 Answers1

Load test with Task.Factory.StartNew

Load test with TaskCreationOptions.LongRunning

What just happened?

Linked