Parallel.ForEach faster than Task.WaitAll for I/O bound tasks?

Question

I have two versions of my program that submit ~3000 HTTP GET requests to a web server.

The first version is based off of what I read here. That solution makes sense to me because making web requests is I/O bound work, and the use of async/await along with Task.WhenAll or Task.WaitAll means that you can submit 100 requests all at once and then wait for them all to finish before submitting the next 100 requests so that you don't bog down the web server. I was surprised to see that this version completed all of the work in ~12 minutes - way slower than I expected.

The second version submits all 3000 HTTP GET requests inside a Parallel.ForEach loop. I use .Result to wait for each request to finish before the rest of the logic within that iteration of the loop can execute. I thought that this would be a far less efficient solution, since using threads to perform tasks in parallel is usually better suited for performing CPU bound work, but I was surprised to see that the this version completed all of the work within ~3 minutes!

My question is why is the Parallel.ForEach version faster? This came as an extra surprise because when I applied the same two techniques against a different API/web server, version 1 of my code was actually faster than version 2 by about 6 minutes - which is what I expected. Could performance of the two different versions have something to do with how the web server handles the traffic?

You can see a simplified version of my code below:

private async Task<ObjectDetails> TryDeserializeResponse(HttpResponseMessage response)
{
    try
    {
        using (Stream stream = await response.Content.ReadAsStreamAsync())
        using (StreamReader readStream = new StreamReader(stream, Encoding.UTF8))
        using (JsonTextReader jsonTextReader = new JsonTextReader(readStream))
        {
            JsonSerializer serializer = new JsonSerializer();
            ObjectDetails objectDetails = serializer.Deserialize<ObjectDetails>(
                jsonTextReader);
            return objectDetails;
        }
    }
    catch (Exception e)
    {
        // Log exception
        return null;
    }
}

private async Task<HttpResponseMessage> TryGetResponse(string urlStr)
{
    try
    {
        HttpResponseMessage response = await httpClient.GetAsync(urlStr)
            .ConfigureAwait(false);
        if (response.StatusCode != HttpStatusCode.OK)
        {
            throw new WebException("Response code is "
                + response.StatusCode.ToString() + "... not 200 OK.");
        }
        return response;
    }
    catch (Exception e)
    {
        // Log exception
        return null;
    }
}

private async Task<ListOfObjects> GetObjectDetailsAsync(string baseUrl, int id)
{
    string urlStr = baseUrl + @"objects/id/" + id + "/details";

    HttpResponseMessage response = await TryGetResponse(urlStr);

    ObjectDetails objectDetails = await TryDeserializeResponse(response);

    return objectDetails;
}

// With ~3000 objects to retrieve, this code will create 100 API calls
// in parallel, wait for all 100 to finish, and then repeat that process
// ~30 times. In other words, there will be ~30 batches of 100 parallel
// API calls.
private Dictionary<int, Task<ObjectDetails>> GetAllObjectDetailsInBatches(
    string baseUrl, Dictionary<int, MyObject> incompleteObjects)
{
    int batchSize = 100;
    int numberOfBatches = (int)Math.Ceiling(
        (double)incompleteObjects.Count / batchSize);
    Dictionary<int, Task<ObjectDetails>> objectTaskDict
        = new Dictionary<int, Task<ObjectDetails>>(incompleteObjects.Count);

    var orderedIncompleteObjects = incompleteObjects.OrderBy(pair => pair.Key);

    for (int i = 0; i < 1; i++)
    {
        var batchOfObjects = orderedIncompleteObjects.Skip(i * batchSize)
            .Take(batchSize);
        var batchObjectsTaskList = batchOfObjects.Select(
            pair => GetObjectDetailsAsync(baseUrl, pair.Key));
        Task.WaitAll(batchObjectsTaskList.ToArray());
        foreach (var objTask in batchObjectsTaskList)
            objectTaskDict.Add(objTask.Result.id, objTask);
    }

    return objectTaskDict;
}

public void GetObjectsVersion1()
{
    string baseUrl = @"https://mywebserver.com:/api";

    // GetIncompleteObjects is not shown, but it is not relevant to
    // the question
    Dictionary<int, MyObject> incompleteObjects = GetIncompleteObjects();

    Dictionary<int, Task<ObjectDetails>> objectTaskDict
        = GetAllObjectDetailsInBatches(baseUrl, incompleteObjects);

    foreach (KeyValuePair<int, MyObject> pair in incompleteObjects)
    {
        ObjectDetails objectDetails = objectTaskDict[pair.Key].Result
            .objectDetails;

        // Code here that copies fields from objectDetails to pair.Value
        // (the incompleteObject)

        AllObjects.Add(pair.Value);
    };
}

public void GetObjectsVersion2()
{
    string baseUrl = @"https://mywebserver.com:/api";

    // GetIncompleteObjects is not shown, but it is not relevant to
    // the question
    Dictionary<int, MyObject> incompleteObjects = GetIncompleteObjects();

    Parallel.ForEach(incompleteHosts, pair =>
    {
        ObjectDetails objectDetails = GetObjectDetailsAsync(
            baseUrl, pair.Key).Result.objectDetails;

        // Code here that copies fields from objectDetails to pair.Value
        // (the incompleteObject)

        AllObjects.Add(pair.Value);
    });
}

At some point you did not use ConfigureAwait(false) (see `GetObjectDetailsAsync`) and that will have a great impact on performance because the code is waiting for synchronization. — Sir Rufo, Sep 25 '19 at 17:06
Also it would be interesting on which context this code is running. WinForms/WPF/asp.net/Console/...? — Sir Rufo, Sep 25 '19 at 17:09
@SirRufo Ah yes I will add ConfigureAwait(false) there to see how it affects performance. It is a Console app (.Net Framework 4.6.1) — davekats, Sep 25 '19 at 17:52
@SirRufo Adding ConfigureAwait(false) in GetObjectDetailsAsync didn't seem to affect performance. — davekats, Sep 25 '19 at 21:32
Yes, a console application has no synchronization context at all and so there are no synchronizations that can impact performance. Thats the reason I wanted to know in which kind of application the code is running — Sir Rufo, Sep 26 '19 at 05:37
The async version of your code is blocking a lot. Is this intentional? Your `GetAllObjectDetailsInBatches` should be an `async` method. You should not `Task.WaitAll` in there but rather `await Task.WhenAll`. Your `TryDeserializeResponse` should be `async` method and reading from stream should be `await response.Content.ReadAsStreamAsync()` without the `.Result`. Get those changes in and see what the numbers say after that. — JohanP, Sep 26 '19 at 10:49
@JohanP I actually did intentionally make `GetAllObjectDetailsInBatches` synchronous and use `Task.WaitAll` because there isn't any other work that can be done while the function is running (This is a console app with no UI). But good catch on `TryDeserializeResponse`. That was a copy/paste error. I made an edit to add the updated version. — davekats, Sep 26 '19 at 18:48
@davekats there is heaps of work to be done, all 3k responses will resume on threadpool threads and youre blocking on all of them. Your pool starts off with as few threads as possible and new ones get injected at 2 a second, so your concurrency and throughput are very low and it seems you are experiencing threadpool starvation. — JohanP, Sep 26 '19 at 21:29
@JohanP I'm having trouble understanding how the main thread blocking on `Task.WaitAll` affects the work being done on the threadpool threads. — davekats, Sep 30 '19 at 15:29
I was more talking about the .Result on ReadStreamAsync but I see you have changed that. — JohanP, Sep 30 '19 at 21:21

score 1 · Answer 1 · answered Sep 25 '19 at 16:50

https://learn.microsoft.com/en-us/dotnet/api/system.threading.tasks.parallel.foreach?view=netframework-4.8

Basically the parralel foreach allows iterations to run in parallel so you are not constraining the iteration to run in serial, on a host that is not thread constrained this will tend to lead to improved throughput

score 1 · Answer 2 · answered Sep 25 '19 at 18:02

In short:

Parallel.Foreach() is most useful for CPU bound tasks.
Task.WaitAll() is more useful for IO bound tasks.

So in your case, you are getting information from webservers, which is IO. If the async methods are implemented correctly, it won't block any thread. (It will use IO Completion ports to wait on) This way the threads can do other stuff.

By running the async methods GetObjectDetailsAsync(baseUrl, pair.Key).Result synchroniced, it will block a thread. So the threadpool will be flood by waiting threads.

So I think the Task solution will have a better fit.

Thanks Jeroen, I agree with you but my question isn't which solution is the better fit. I know that the Task solution is the better fit, but what I am seeing is that the Task solution is actually slower - an unexpected result. So my question is why, in this case, is the Parallel solution faster? — davekats, Sep 25 '19 at 21:22

score 1 · Accepted Answer · answered Sep 25 '19 at 18:48

A possible reason why Parallel.ForEach may run faster is because it creates the side-effect of throttling. Initially x threads are processing the first x elements (where x in the number of the available cores), and progressively more threads may be added depending on internal heuristics. Throttling IO operations is a good thing because it protects the network and the server that handles the requests from becoming overburdened. Your alternative improvised method of throttling, by making requests in batches of 100, is far from ideal for many reasons, one of them being that 100 concurrent requests are a lot of requests! Another one is that a single long running operation may delay the completion of the batch until long after the completion of the other 99 operations.

Note that Parallel.ForEach is also not ideal for parallelizing IO operations. It just happened to perform better than the alternative, wasting memory all along. For better approaches look here: How to limit the amount of concurrent async I/O operations?

Thank you for the great answer! So the difference between the batch solution and the solution using SemaphoreSlim is that the SemaphoreSlim solution will always have 20 requests running at the same time - starting a new request as soon as one of the 20 completes, whereas the batch method will wait to start the next batch until all requests are complete in the current batch. I like it. — davekats, Sep 25 '19 at 21:29

Parallel.ForEach faster than Task.WaitAll for I/O bound tasks?

3 Answers3