6

Background

I have to download about 16k documents and the same amount of html pages from the internet. This number will increase in the future. Currently I am just using Parallel.ForEach to download and work on the data in parallel. This however does not seem to fully utilize my resources, so I am planning to bring async/await into play, to have as many downloads running in asynchronously as possible, but I will probably have to limit that.

Actual Question

How many open connections can a single HttpClient have? What other factors will I have to keep in mind when creating such an amount of connections? I am aware that I should reuse the same HttpClientand I have also read this answer, but I have doubts that I can really have several billion connections open at once.

Uwe Keim
  • 39,551
  • 56
  • 175
  • 291
Jerome Reinländer
  • 1,227
  • 1
  • 10
  • 26
  • Another limiting factor to consider is the optimal number of running threads. Microsoft recommends limiting to 25 threads per virtual processor core. Otherwise you start hitting diminishing returns on the performance gains. – Kevin Jul 11 '18 at 12:51
  • 1
    @Kevin - async/await means you don't need a thread / connection. So that argument doesn't hold. – bommelding Jul 11 '18 at 12:59
  • @Franck - it is not about cores/threads at all. [NodeJS can handle 1000s](https://www.quora.com/How-many-connections-can-a-node-js-server-handle) on a single thread. .NET is catching up with that. – bommelding Jul 11 '18 at 13:00
  • @bommelding the argument still holds as far as regulating how many concurrent threads you have going (if you are programmatically spinning up threads), however you are correct the there is theoretically no limit to how many async calls you can have pending, and the .Net Framework is pretty good at limiting the concurrency and handling thread processor affinity for you in those circumstances. – Kevin Jul 11 '18 at 13:02
  • @Kevin - your argument still only holds against the Parallel.ForEach() approach, and the OP knows that. There is no reason to assume any parallelism in the new approach. – bommelding Jul 11 '18 at 13:05
  • @bommelding No matter if you use anything : `Parallel.ForEach` or `Async Await` or [Insert any threading method here] if you only have 1 core on the computer only of of those thread have CPU cycle at the time. You create context switching that eat up extra cycle that's it. – Franck Jul 11 '18 at 13:08
  • 1
    @Franck : `async await` is not a threading method. The question is about 1000s of connections, on 1 or 2 threads. – bommelding Jul 11 '18 at 13:11
  • @JeromeReinländer - you will have to try (test) on your target system. Lots of network layers could throttle this. But it will almost certainly do better than Parallel.ForEach(). – bommelding Jul 11 '18 at 13:18
  • @JeromeReinländer It depends on the platform. Please specify your OS and whether you're on full framework or .NET Core – Todd Menier Jul 11 '18 at 14:15

1 Answers1

11

First, good call on switching from Parallel.ForEach to async/await. By breaking from the constraints of threads, you'll be able to increase concurrency by orders of magnitude.

I have doubts that I can really have several billion connections open at once.

Let's say you could. Do you think the job would complete any faster than if you had, say, 1000 open at once? The limitation you're going to bump up against first is bandwidth (or possibly the server refusing requests), not concurrent connections. So I would suggest the max number of connections you can possibly have open at once isn't even relevant if your goal is to complete the job as fast as possible.

That said, there are default limits imposed by .NET. Assuming you're on full framework or .NET Core 2.x, the limit can be changed programatically via ServicePointManager.DefaultConnectionLimit, which has a default value of just 2. Set it to something much bigger.

Next I would suggest setting up your code to perform the downloads concurrently up to some limit, using either SemaphoreSlim or TPL Dataflow. Both approaches are well covered in answers to this question. Then start experimenting until you come up with an optimal number. Hard to say what that is. Maybe start with 50. If it goes well, increase it to 100 and see if the overall job completes any faster. If you start getting socket exceptions or errors returned from the server, dial it down.

Todd Menier
  • 37,557
  • 17
  • 150
  • 173
  • 3
    A quick note to whoever reads this in the future: be careful when switching to `async`. Apparently this can be **massively** faster and might get you blocked by external endpoints. – Jerome Reinländer Jul 25 '18 at 12:38
  • Although this answer provides a lot of insights to some part of the question, it still miss to answer the main question. – mr5 Jan 19 '20 at 16:23
  • @ToddMenier I don't have a better answer but I think somebody already got the specific numbers: https://learn.microsoft.com/en-us/archive/blogs/timomta/controlling-the-number-of-outgoing-connections-from-httpclient-net-core-or-full-framework – mr5 Jan 19 '20 at 19:24
  • 4
    @mr5 That link covers the default connection limit, which I covered as well, and doesn't at all answer the question of how many connections are actually possible. You can remove the limit, but you're still bound by other factors, as I already explained. There isn't a hard number that applies in every scenario, and I'm sorry if you find that fact unsatisfactory. – Todd Menier Jan 20 '20 at 14:16
  • 1
    Yeah... watch out if you are using HttpClient with Azure Functions. The maximum limit of active connections is 600 currently and the limit of total connections is 1200. Also, .NET (newer .NET Core 3) is not using the ServicePointManager anymore. – El Mac Sep 16 '21 at 16:55
  • I switched my download function to be `async`, and it runs about `310` times faster now. – Jonathan Barraone Nov 26 '22 at 18:30