-1

Consider the following code:

var options = new ParallelOptions();
var urls = GetListOfUrls();

Parallel.ForEach(urls, options, url => {
    try {
        using (HttpClient client = new HttpClient()) {
            client.Timeout = TimeSpan.FromMinutes(30);
            Task task = client.GetAsync(url);
            task.Wait();
        }
    } catch (Exception exception) {
        Console.WriteLine(exception.Message);
    }

});

It spins up ~670 threads on a VM with 8 cores. Is this normal? My understanding was that the rule of thumb for TPL was 25 threads per core, which would put it into the 200 threads range.

P.S. In the code below GetListOfUrls() returns a million URLs.

enter image description here

enter image description here

AngryHacker
  • 59,598
  • 102
  • 325
  • 594
  • 4
    [You're using HttpClient wrong](https://aspnetmonsters.com/2016/08/2016-08-27-httpclientwrong/) – H H May 29 '19 at 20:37
  • 1
    I don't know your "rule of thumb", do you have a source? The aim is 1 thread / core. – H H May 29 '19 at 20:39
  • 1
    The max nr of threads is in the thousands, and with slow I/O you will get there with 1 thread / 500 ms. – H H May 29 '19 at 20:40
  • 4
    In my experience, this is quite normal. If your actions block then TPL will spin up more threads and many scenarios fall into this "advanced usage scenario" for MaxDegreeOfParallelism: "When the thread pool's heuristics is unable to determine the right number of threads to use and could end up injecting too many threads. For example, in long-running loop body iterations, the thread pool might not be able to tell the difference between reasonable progress or livelock or deadlock, and might not be able to reclaim threads that were added to improve performance. " – Mike Zboray May 29 '19 at 20:41
  • 2
    @HenkHolterman Thanks for the HttpClient link. Wow. The fix seems like such an anti-pattern. – AngryHacker May 29 '19 at 20:43
  • Mixing Parallel.ForEach with async code, this code is for demonstration of the problem only I assume? Otherwise use Task.WhenAll or something else if you want to control/limit concurrency: https://blogs.msdn.microsoft.com/fkaduk/2018/09/02/multiple-ways-how-to-limit-parallel-tasks-processing/. Using .Wait() on a task is an anti-pattern as it makes the code synchronous. – Peter Bons May 29 '19 at 20:44
  • I agree with @PeterBons, Parallel and async don't mix well. This code is not totally invalid, but not optimal either. The .Wait() is the fix here, normally an anti-pattern. – H H May 29 '19 at 20:46
  • The "using httpclient wrong" article is great at the analysis, not so much at the solution. Find it on docs.microsoft, and look for HttpClientFactory. – H H May 29 '19 at 20:47
  • @PeterBons this code is just a test to beat up a server. The `urls` list has a million entries in it. Would Task.WhenAll be able to handle that size? – AngryHacker May 29 '19 at 20:49
  • @HenkHolterman I like that Microsoft has published articles on how to properly used that class, but the class is just fundamentally broken by design. It's akin to designing a car with left and right turn signals and then later coming along with an article saying "You weren't supposed to turn right with that car, here's how to do it instead". The class was implemented with IDisposable, we all know what to do with classes like that, and nobody reads documentation. Pit of success and all that. – Lasse V. Karlsen May 29 '19 at 20:51
  • 1
    `HttpClient` wraps its WebRequest calls in a separate thread ([link](https://stackoverflow.com/questions/43237250/will-httpclient-async-methods-run-in-new-threads)). So you are actually creating two threads per request. You can reduce the number by using a single HttpClient instance (so your requests share proxy and DNS cache). Also, I suggest you change your solution to use `async` instead of TPL. – John Wu May 29 '19 at 20:53
  • 1
    Yes, see **Solution #2 using SemaphoreSlim** of the link I posted – Peter Bons May 29 '19 at 20:54
  • 1
    I'm wondering if we might just be overdue for a nuget package that does http requests properly **out of the box** by now. – Lasse V. Karlsen May 29 '19 at 20:54
  • 1
    @AngryHacker, the docs will point you to this [blog](https://devblogs.microsoft.com/pfxteam/processing-tasks-as-they-complete/) in the end. – H H May 29 '19 at 20:54

2 Answers2

1

Your code example is not written with async in mind. You do not need to Parallel.ForEachat all. HttpClient.GetAsync is already async there there is no point wrapping it in CPU bound tasks.

private readonly _httpClient = new HttpClient();

var tasks = new List<Task>();
foreach(var url in urls)
{
    var task = DoWork(url);
    tasks.Add(task);
}
await Task.WhenAll(tasks);

foreach(var task in tasks)
{
  if (task.Exception != null)
    Console.WriteLine(task.Exception.Message);
}

public async Task DoWork(string url)
{                
  var json = await _httpClient.GetAsync(url);
  // do something with json
}

While Parallel.ForEach() is a more efficient version of looping and using Task.Run() it should really only be used for Cpu Bound work (Task.Run Etiquette and Proper Usage). Calling a URL is not CPU bound work, it's I/O work (or more technically referred to as IO Completion Port work).

YOU'RE USING HTTPCLIENT WRONG AND IT IS DESTABILIZING YOUR SOFTWARE

Thanks for the HttpClient link. Wow. The fix seems like such an anti-pattern.

While it may seem like an anti-pattern, it's because the solution provided actually is an anti-pattern. The HttpClient is using external resources to do it's job, so it should be disposed (it implements IDisposable) but at the same time it should be used as a singleton. That poses the problem because there is no clean way to dispose of a singleton used as static property/field on a class. However, since it was written by mirosoft, we should not be concerned that we always need to dispose of objects we create if the documentation states otherwise.

Since you point out that neither Parallel.ForEach nor Task.Run is suited for HttpClient work because it's I/O bound, what would you recommend?

Async/Await

I've added the million portion to the question

So you need to limit the number of parallel tasks so:

var maximumNumberofParallelOperations = 1;
foreach(var url in urls)
{
    var task = DoWork(url);
    tasks.Add(task);
    while (allTasks.Count(t => !t.IsCompleted) >= maximumNumberofParallelOperations )
    {
      await Task.WhenAny(allTasks);
    }
}
Erik Philips
  • 53,428
  • 11
  • 128
  • 150
  • This too won't scale to a million URLs. – H H May 29 '19 at 21:25
  • @HenkHolterman he needs to reflect that **in the question itself**. Comments do not add to the question criteria. – Erik Philips May 29 '19 at 21:27
  • @ErikPhilips The reason the million URLs weren't mentioned was because my question concerned the number of created threads rather than incorrect usage of HttpClient. – AngryHacker May 29 '19 at 21:30
  • 1
    I've added the million portion to the question. – AngryHacker May 29 '19 at 21:33
  • @ErikPhilips Since you point out that neither Parallel.ForEach nor Task.Run is suited for HttpClient work because it's I/O bound, what would you recommend? – AngryHacker May 29 '19 at 21:34
  • @AngryHacker so my answer is valid until you mention the number of requests. As it stands this will reduce your threads to 1 (until the any call is completed). If you do not want them processes in parallel, I recommend using `ConfigureAwait()` (in a UI application) otherwise you have a different issue. – Erik Philips May 29 '19 at 21:34
  • @AngryHacker updated to limit the number of parallel calls – Erik Philips May 29 '19 at 21:53
  • @ErikPhilips The approach of using a shared httpClient has problems of its own. It never releases the connection to the server. For instance, let's say I am hitting URLs on http://foo.com?productId={x} with different values of x. And let's say that foo.com backend has a server farm of 10 boxes - all behind a load balancer. Once the first .GetAsync goes out, a TCP connection is made to box 3, for example. All subsequent calls will also go to box 3 because it reuses the TCP connection. – AngryHacker May 29 '19 at 22:56
  • @AngryHacker that is a different question. You should ask that as a new question as it has nothing to do with reducing Threads as your original question asked. Link it here if you ask it, there is an answer. – Erik Philips May 30 '19 at 00:26
  • @ErikPhilips Agreed. This question went completely off the rails. – AngryHacker May 30 '19 at 01:01
0

Parallel is overkill here, since the HttpClient methods are async. I would suggest calling them asynchronously instead. Otherwise you will create a high number of threads that are doing nothing but waiting.

Also, use a single instance of the HttpClient. This will increase cache hits and reduce the need for the HttpClient to spin up threads to access proxy or DNS information.

var client = new HttpClient();
var tasks = urls.Select( url => client.GetAsync(url) ).ToList();
await Task.WhenAll(tasks);
var results = tasks.Select( task => task.Result ).ToList();
John Wu
  • 50,556
  • 8
  • 44
  • 80
  • This is documented as (only suitable for a small number of tasks)[https://learn.microsoft.com/en-us/dotnet/csharp/programming-guide/concepts/async/start-multiple-async-tasks-and-process-them-as-they-complete#complete-example] – H H May 29 '19 at 21:01
  • I took a look and I believe you are misapplying the article's advice. The author is concerned with how quickly post-processing can begin. OP performs no post-processing so the concern does not apply. – John Wu May 29 '19 at 21:10
  • I've never had such a requirement. The TCP/IP stack would run out of ephemeral ports long before then. – John Wu May 29 '19 at 21:24
  • Exactly. But the OP does have a million URLs. – H H May 29 '19 at 21:25
  • 1
    Should probably refactor to use a queue of some kind, then. I agree that `WhenAll` is not great for that use case. – John Wu May 29 '19 at 21:33