15

I'm trying to figure out the correct way to parallelize HTTP requests using Task and async/await. I'm using the HttpClient class which already has async methods for retrieving data. If I just call it in a foreach loop and await the response, only one request gets sent at a time (which makes sense because during the await, control is returning to our event loop, not to the next iteration of the foreach loop).

My wrapper around HttpClient looks as such

public sealed class RestClient
{
    private readonly HttpClient client;

    public RestClient(string baseUrl)
    {
        var baseUri = new Uri(baseUrl);

        client = new HttpClient
        {
            BaseAddress = baseUri
        };
    }

    public async Task<Stream> GetResponseStreamAsync(string uri)
    {
        var resp = await GetResponseAsync(uri);
        return await resp.Content.ReadAsStreamAsync();
    }

    public async Task<HttpResponseMessage> GetResponseAsync(string uri)
    {
        var resp = await client.GetAsync(uri);
        if (!resp.IsSuccessStatusCode)
        {
            // ...
        }

        return resp;
    }

    public async Task<T> GetResponseObjectAsync<T>(string uri)
    {
        using (var responseStream = await GetResponseStreamAsync(uri))
        using (var sr = new StreamReader(responseStream))
        using (var jr = new JsonTextReader(sr))
        {
            var serializer = new JsonSerializer {NullValueHandling = NullValueHandling.Ignore};
            return serializer.Deserialize<T>(jr);
        }
    }

    public async Task<string> GetResponseString(string uri)
    {
        using (var resp = await GetResponseStreamAsync(uri))
        using (var sr = new StreamReader(resp))
        {
            return sr.ReadToEnd();
        }
    }
}

And the code invoked by our event loop is

public async void DoWork(Action<bool> onComplete)
{
    try
    {
        var restClient = new RestClient("https://example.com");

        var ids = await restClient.GetResponseObjectAsync<IdListResponse>("/ids").Ids;

        Log.Info("Downloading {0:D} items", ids.Count);
        using (var fs = new FileStream(@"C:\test.json", FileMode.Create, FileAccess.Write, FileShare.Read))
        using (var sw = new StreamWriter(fs))
        {
            sw.Write("[");

            var first = true;
            var numCompleted = 0;
            foreach (var id in ids)
            {
                Log.Info("Downloading item {0:D}, completed {1:D}", id, numCompleted);
                numCompleted += 1;
                try
                {
                    var str = await restClient.GetResponseString($"/info/{id}");
                    if (!first)
                    {
                        sw.Write(",");
                    }

                    sw.Write(str);

                    first = false;
                }
                catch (HttpException e)
                {
                    if (e.StatusCode == HttpStatusCode.Forbidden)
                    {
                        Log.Warn(e.ResponseMessage);
                    }
                    else
                    {
                        throw;
                    }
                }
            }

            sw.Write("]");
        }

        onComplete(true);
    }
    catch (Exception e)
    {
        Log.Error(e);
        onComplete(false);
    }
}

I've tried a handful of different approaches involving Parallel.ForEach, Linq.AsParallel, and wrapping the entire contents of the loop in a Task.

Austin Wagner
  • 1,072
  • 1
  • 17
  • 30
  • Possible duplicate of [Nesting await in Parallel.ForEach](https://stackoverflow.com/questions/11564506/nesting-await-in-parallel-foreach) – Michael Freidgeim Nov 30 '18 at 18:27

1 Answers1

25

The basic idea is to keep of track of all the asynchronous tasks, and awaiting them at once. The simplest way to do this is to extract the body of your foreach to a separate asynchronous method, and do something like this:

var tasks = ids.Select(i => DoWorkAsync(i));
await Task.WhenAll(tasks);

This way, the individual tasks are issued separately (still in sequence, but without waiting for the I/O to complete), and you await them all at the same time.

Do note that you will also need to do some configuration - HTTP is throttled by default to only allow two simultaneous connections to the same server.

Luaan
  • 62,244
  • 7
  • 97
  • 116
  • See the accepted answer here: http://stackoverflow.com/questions/19102966/parallel-foreach-vs-task-run-and-task-whenall – Jim L Feb 09 '17 at 15:36
  • 1
    @AustinWagner By default, yes. HTTP throttling is part of the HTTP specification, so technically disabling (or relaxing) it is violating the specification. That said, we're living in different times - multiple concurrent requests aren't as bad as when HTTP was first designed. In any case, if you expect to get (significantly) rate limited, you might want to implement your own throttling as well anyway - otherwise you're just wasting a bunch of memory doing things in parallel when you could stream them instead - assuming you don't need all the responses at the same time, of course. – Luaan Feb 09 '17 at 15:42
  • 5
    This worked, but with two caveats: 1. The connection limit @Luaan mentioned (worked around with `ServicePointManager.DefaultConnectionLimit = 8`) 2. The `HttpClient` timeout seems to start from when the request is queued, not when it's sent to the server. Though not optimal, I worked around this by just setting a long timeout. With this solution in place I now have an ~50 minute download process completing in ~5 minutes. – Austin Wagner Feb 09 '17 at 17:12