3

I have about 5 million items to update. I don't really care about the response (A response would be nice to have so I can log it, but I don't want a response if that will cost me time.) Having said that, is this code optimized to run as fast as possible? If there are 5 million items, would I run the risk of getting any task cancelled or timeout errors? I get about 1 or 2 responses back every second.

var tasks = items.Select(async item =>
{
    await Update(CreateUrl(item));
}).ToList();

if (tasks.Any())
{
    await Task.WhenAll(tasks);
}                

private async Task<HttpResponseMessage> Update(string url)
{
    var client = new HttpClient();
    var response = await client.SendAsync(url).ConfigureAwait(false);    
    //log response.
}

UPDATE: I am actually getting TaskCanceledExceptions. Did my system run out of threads? What could I do to avoid this?

Prabhu
  • 12,995
  • 33
  • 127
  • 210
  • 5
    I'm thinking that if your plan involves issuing 5 million HTTP requests, you're probably doing something wrong. Is there no separate bulk update API or endpoint available to you? – Damien_The_Unbeliever Aug 27 '14 at 06:29
  • @Damien_The_Unbeliever Unfortunately, no. It's a 3rd party API and they don't have that feature. – Prabhu Aug 27 '14 at 06:33
  • It doesn't seem you will get much benefit from doing this is so many threads. The bottleneck is the network (and the server on the other side), it's not like you are performing a cpu intensive task that would benefit from parallelization. Especially if it takes a second to process a single request (which might have to do with the mini-DoS attack you are performing on that server). – vgru Aug 27 '14 at 06:37
  • @Groo Is there no way to complete the task right after sending the request, without waiting for the response? – Prabhu Aug 27 '14 at 06:41
  • 3
    And if you're getting 2 responses a second, have you considered that 5 million updates are going to take ~29 days to complete? – Damien_The_Unbeliever Aug 27 '14 at 06:43
  • @Damien_The_Unbeliever patience is a virtue. – CSharpie Aug 27 '14 at 06:47
  • @Damien_The_Unbeliever Yeah, I don't mind the # of days it's going to take. As long as my program is done sending the updates, I'm good. They have a rate limit of 500 requests per second so not sure why it's taking a second or two for each response. I do have a throttle that controls the # of requests going out per second (not shown here for simplicity). The # of tasks created is the part I am not controlling. – Prabhu Aug 27 '14 at 06:48
  • @Prabhu: a HTTP request must be accompanied by a HTTP response. If you close the socket just after calling `SendAsync`, you don't have a clue if server will get the data properly. Have you tried simply sending them sequentially? – vgru Aug 27 '14 at 06:57
  • @Groo So should I be using a SemaphoreSlim to limit the # of threads being created? I've tried sequentially, but that's definitely going slower than what I currently have. – Prabhu Aug 27 '14 at 07:02
  • If you don't specify the [`MaxDegreeOfParallelism`](http://stackoverflow.com/questions/9538452/what-does-maxdegreeofparallelism-do) property, number of concurrently running tasks will be limited by the number of logical cores. On the other hand, you can also change the max number of threads in the `ThreadPool` ([`SetMaxThreads`](http://msdn.microsoft.com/en-us/library/system.threading.threadpool.setmaxthreads(v=vs.110).aspx)), but IIRC .NET is rather conservative in creating new threadpool threads anyway (it certainly won't create thousands of threads per second). – vgru Aug 27 '14 at 07:23
  • The problem with your tasks is that the `HttpClient` basically suspends each task until the response is received, meaning that CLR is free to start running other tasks (having nothing better to do), and then run awaited continuations when they're ready to run. – vgru Aug 27 '14 at 07:34
  • @Groo so is there no way around this except wait for the response before sending more requests? I might as well just send 1-2 requests a second right? – Prabhu Aug 27 '14 at 07:36
  • 1
    @Prabhu: well, you could use a semaphore as you've written to avoid creating too many requests. Try to play with different limits for the semaphore, test (i.e. profile) and if you manage to get a sweet spot, use it, otherwise the bottleneck is likely on the other side. – vgru Aug 27 '14 at 07:39
  • 1
    @Prabhu The OS limits the number of concurrent requests you can execute via your network device driver. What you're experiencing is *probably* requests timing out while waiting inside the queue. You can throttle work using a `SemaphoreSlim`, or look into [`TPL Dataflow`](http://msdn.microsoft.com/en-us/library/hh228603(v=vs.110).aspx) – Yuval Itzchakov Aug 27 '14 at 07:44
  • Actually you know what...I had had HttpClient declared as a class variable, so the same client object was being used to send all requests (I have it wrong in the question). Once I made HttpClient local (as it is currently written in the question), I am seeing a lot more responses come back every second. – Prabhu Aug 27 '14 at 09:55
  • @Prabhu Note that `HttpClient` is designed to be used globally. Making as many requests per second means that the GC will also have to dispose each instance after sending the request, which *might* hurt performance. – Yuval Itzchakov Aug 27 '14 at 16:16
  • @YuvalItzchakov Interesting. The difference in response rate is just huge though. If I make 500 requests a second, with a local HttpClient, I am getting back responses almost instantly (500 a second), but with the global, it's almost like it's sequential. – Prabhu Aug 27 '14 at 16:47
  • @YuvalItzchakov I did get some of these errors though at times, any idea what they might be: Message: Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host. Inner Exception: System.Net.Sockets.SocketException – Prabhu Aug 27 '14 at 16:50
  • @YuvalItzchakov and this one too: Message: An error occurred while sending the request. Inner Exception: System.Net.WebException: The underlying connection was closed: An unexpected error occurred on a receive. ---> System.IO.IOException: Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host. ---> System.Net.Sockets.SocketException – Prabhu Aug 27 '14 at 16:50

3 Answers3

4

You method will kick off all tasks at the same time, which may not be what you want. There wouldn't be any threads involved because with async operations There is no thread, but there may be number of concurrent connection limits.

There may be better tools to do this but if you want to use async/await one option is to use Stephen Toub's ForEachAsync as documented in this article. It allows you to control how many simultaneous operations you want to execute, so you don't overrun your connection limit.

Here it is from the article:

public static class Extensions
{
     public static async Task ExecuteInPartition<T>(IEnumerator<T> partition, Func<T, Task> body)
     {
         using (partition)
             while (partition.MoveNext())
                await body(partition.Current);
     }

     public static Task ForEachAsync<T>(this IEnumerable<T> source, int dop, Func<T, Task> body)
     {      
         return Task.WhenAll(
             from partition in Partitioner.Create(source).GetPartitions(dop)
                  select ExecuteInPartition(partition, body));
     }
}

Usage:

public async Task UpdateAll()
{
    // Allow for 100 concurrent Updates
    await items.ForEachAsync(100, async t => await Update(t));  
}
NeddySpaghetti
  • 13,187
  • 5
  • 32
  • 61
  • 1
    Im confused. You reference a great article that says that making async IO requests requires no extra threads (other the current one and an IO completion port), and then you suggest using multiple threads (your solution uses `Task.Run`) to do the same IO bound work? – Yuval Itzchakov Aug 27 '14 at 16:21
  • 1
    Unsure about the approach here but that "there is no thread" article is BOSS. – JasonCoder Aug 27 '14 at 16:52
  • @YuvalItzchakov the author said he added it to enable extra parallelism, but I can see your point. I've updated the code. – NeddySpaghetti Aug 28 '14 at 08:58
2

A much better approach would be to use TPL Dataflow's ActionBlock with MaxDegreeOfParallelism and a single HttpClient:

Task UpdateAll(IEnumerable<Item> items)
{
    var block = new ActionBlock<Item>(
        item => UpdateAsync(CreateUrl(item)), 
        new ExecutionDataflowBlockOptions {MaxDegreeOfParallelism = 1000});

    foreach (var item in items)
    {
        block.Post(item);
    }

    block.Complete();
    return block.Completion;
}

async Task UpdateAsync(string url)
{
    var response = await _client.SendAsync(url).ConfigureAwait(false);    
    Console.WriteLine(response.StatusCode);
}
  • A single HttpClient can be used concurrently for multiple requests, and so it's much better to only create and disposing a single instance instead of 5 million.
  • There are numerous problems in firing so many request at the same time: The machine's network stack, the target web site, timeouts and so forth. The ActionBlock caps that number with the MaxDegreeOfParallelism (which you should test and optimize for your specific case). It's important to note that TPL may choose a lower number when it deems it to be appropriate.
  • When you have a single async call at the end of an async method or lambda expression, it's better for performance to remove the redundant async-await and just return the task (i.e return block.Completion;)
  • Complete will notify the ActionBlock to not accept any more items, but finish processing items it already has. When it's done the Completion task will be done so you can await it.
i3arnon
  • 113,022
  • 33
  • 324
  • 344
0

I suspect you are suffering from outgoing connection management preventing large numbers of simultaneous connections to the same domain. The answers given in this extensive Q+A might give you some avenues to investigate.

What is limiting the # of simultaneous connections my ASP.NET application can make to a web service?

In terms of your code structure, I'd personally try and use a dynamic pool of connections. You know that you cant actually get 5m connections simultaneously so trying to attempt it will just fail to work - you may as well deal with a reasonable and configured limit of (for instance) 20 connections and use them in a pool. In this way you can tune up or down.

alternatively you could investigate HTTP Pipelining (which I've not used) which is intended specifically for the job you are doing (batching up Http requests). http://en.wikipedia.org/wiki/HTTP_pipelining

Community
  • 1
  • 1
PhillipH
  • 6,182
  • 1
  • 15
  • 25