1

I need to fetch content from some 3000 urls. I'm using HttpClient, create Task for each url, add tasks to list and then await Task.WhenAll. Something like this

    var tasks = new List<Task<string>>();
    foreach (var url in urls) {
        var task = Task.Run(() => httpClient.GetStringAsync(url));
        tasks.Add(task);
    }

    var t = Task.WhenAll(tasks);

However many tasks end up in Faulted or Canceled states. I thought it might be problem with the concrete urls, but no. I can fetch those url no problem with curl in parallel.

I tried HttpClientHandler, WinHttpHandler with various timeouts etc. Always several hundred urls end with an error. Then I tried to fetch those urls in batches of 10 and that works. No errors, but very slow. Curl will fetch 3000 urls in parallel very fast. Then I tried to get httpbin.org 3000 times to verify that the issue is not with my particular urls:

    var handler = new HttpClientHandler() { MaxConnectionsPerServer = 5000 };
    var httpClient = new HttpClient(handler);

    var tasks = new List<Task<HttpResponseMessage>>();
    foreach (var _ in Enumerable.Range(1, 3000)) {
        var task = Task.Run(() => httpClient.GetAsync("http://httpbin.org"));
        tasks.Add(task);
    }

    var t = Task.WhenAll(tasks);
    try { await t.ConfigureAwait(false); } catch { }

    int ok = 0, faulted = 0, cancelled = 0;

    foreach (var task in tasks) {
        switch (task.Status) {
            case TaskStatus.RanToCompletion: ok++; break;
            case TaskStatus.Faulted: faulted++; break;
            case TaskStatus.Canceled: cancelled++; break;

        }
    }

    Console.WriteLine($"RanToCompletion: {ok} Faulted: {faulted} Canceled: {cancelled}");

Again, always several hundred Tasks end in error.

So, what is the issue here? Why I cannot get those urls with async?

I'm using .NET Core and therefore the suggestion to use ServicePointManager (Trying to run multiple HTTP requests in parallel, but being limited by Windows (registry)) is not applicable.

Also, the urls I need to fetch point to different hosts. The code with httpbin is just a test, to show that the problem was not with my urls being invalid.

Alexei Levenkov
  • 98,904
  • 14
  • 127
  • 179
jira
  • 3,890
  • 3
  • 22
  • 32
  • 2
    Why do you wrap `httpClient.GetStringAsync(url)` in `Task.Run`? It already gives you a `Task`. Also, starting _all_ of those requests nearly at the same time, I'd actually expect some to fault / time out. I'd try using `Parallel.ForEach` to have a little more control over parallelism. – Fildor May 01 '20 at 17:51
  • I guess `Task.WhenAll()` will fail when any of the tasks throws an exception. Try wrapping the `httpClient.GetAsync()` in the try block – zafar May 01 '20 at 17:56
  • Is it due to simplification or are you in fact ignoring the actual results? – Fildor May 01 '20 at 17:57
  • I just stripped the code to the minimum. In the actual app I'd want to do some processing of the fetched content. – jira May 01 '20 at 18:00
  • 1
    Also note, bombarding one url with 3000 requests at a time (or at least _very_ short intervall) _can_ get you flood-banned or at least throttled. – Fildor May 01 '20 at 18:00
  • Yeah, but there was no problem when I run those 3000 requests with curl. And it was very fast – jira May 01 '20 at 18:01
  • Did you try to run the requests in sequence (just for the sake of measurement)? And can you add how you tested with curl? – Fildor May 01 '20 at 18:03
  • I tried in batches of ten. It worked, but was painfully slow. Did not actually measure it but must have taken line 15 minutes. Curl was <1min for sure. – jira May 01 '20 at 18:05
  • Remove `Task.Run` from your code. Just use `var task = http.Client.Get...`. – Alexander Petrov May 01 '20 at 21:47
  • @Fildor the `Parallel` class [is not async friendly](https://stackoverflow.com/questions/15136542/parallel-foreach-with-asynchronous-lambda). For limiting the concurrency of async operations look [here](https://stackoverflow.com/questions/10806951/how-to-limit-the-amount-of-concurrent-async-i-o-operations). The easiest way is by using a `SemaphoreSlim`. – Theodor Zoulias May 02 '20 at 05:18
  • @TheodorZoulias correct. I'd not mix parallel and TAP. That wasn't clear. – Fildor May 02 '20 at 05:36
  • 2
    As a side note, bombarding the httpbin.org with 3000 requests, and posting code that invites everyone to do the same, could result in increased hosting costs for the poor site, and could be seen as a mild form of [DDoS attack](https://en.wikipedia.org/wiki/Denial-of-service_attack). So personally I am not going to attempt to verify the OP's experiment. – Theodor Zoulias May 02 '20 at 05:39
  • 1
    In the "real" scenario are the 3000 URLs hitting the same host? If they are (or maybe even if they're not), initiating 3000 requests at the same time is rarely advisable. Consider throttling to something like 50 or 100 at a time. [Here](https://stackoverflow.com/q/22492383/62600) you'll find excellent examples of doing this with both SemphorSlim and TPL Dataflow. – Todd Menier May 02 '20 at 14:56
  • @ToddMenier No, various hosts. Some can repeat but about 10 times maximum. – jira May 02 '20 at 15:03

1 Answers1

1

As Fildor said in the comments, httpClient.GetStringAsync returns Task. So you don't need to wrap it in Task.Run.

I ran this code in the console app. It took 50 seconds to complete. In your comment, you wrote that curl performs 3000 queries in less than a minute - the same thing.

var httpClient = new HttpClient();
var tasks = new List<Task<string>>();
var sw = Stopwatch.StartNew();

for (int i = 0; i < 3000; i++)
{
    var task = httpClient.GetStringAsync("http://httpbin.org");
    tasks.Add(task);
}

Task.WaitAll(tasks.ToArray());
sw.Stop();

Console.WriteLine(sw.Elapsed);
Console.WriteLine(tasks.All(t => t.IsCompleted));

Also, all requests were completed successfully.

In your code, you are waiting for tasks started using Task.Run. But you need to wait for the completion of tasks started by calling httpClient.Get...

Alexander Petrov
  • 13,457
  • 2
  • 20
  • 49
  • 1
    I don't expect the `Task.Run` to be the problem. This method when used with an async delegate creates a thin wrapper (a proxy) of the created `Task`. It offers no benefit in this case, but shouldn't be harmful either (beyond hurting a bit the readability of the code). – Theodor Zoulias May 02 '20 at 05:29