5

I need to scrape data from a website. I have over 1,000 links I need to access, and previously I was dividing the links 10 per thread, and would start 100 threads each pulling 10. After few test cases, 100 threads was the best count to minimize the time it retrieved the content for all the links.

I realized that .NET 4.0 offered better support for multi-threading out of the box, but this is done based on how many cores you have, which in my case does not spawn enough threads. I guess what I am asking is: what is the best way to optimize the 1,000 link pulling. Should I be using .ForEach and let the Parallel extension control the amount threads that get spawned, or find a way to tell it how many threads to start and divide the work?

I have not worked with Parallel before so maybe my approach maybe wrong.

D Stanley
  • 149,601
  • 11
  • 178
  • 240
Zoinky
  • 4,083
  • 11
  • 40
  • 78
  • Pulling links from the web is not a CPU bound task, so adding lots of threads will probably not help you much. Additionally, spawning 100 threads is a bad idea on most current hardware. Take a look at async for this. – Brian Rasmussen Feb 08 '13 at 16:52
  • @BrianRasmussen: For heavily network IO bound tasks, that's not necessarily true. As long as the thread pool is not exhausted, allowing more concurrent requests is probably a good thing. If you have 100 threads and an average response time of 1 second, that's only at most 100 context switches per second on a single core system, or 25 on a quad core system. Of course those are all assumed numbers, but looks like the OP has tried a variety of parameters and settled on those being best for his use case and hardware. – Eric J. Feb 08 '13 at 16:54
  • @EricJ. That's why I say "probably". Any way, I would still go with an async solution before spinning up 100 threads. – Brian Rasmussen Feb 08 '13 at 16:56
  • @BrianRasmussen: Async would be a very good alternative. Maybe post an alternative answer? – Eric J. Feb 08 '13 at 16:57
  • @BrianRasmussen Yeah, but if you don't care about the wasted 100 MB of memory and you can't use C# 5.0, then creating 100 threads is much better than doing it asynchronously. – svick Feb 08 '13 at 18:21

4 Answers4

6

you can use MaxDegreeOfParallelism property in Parallel.ForEach to control the number of threads that will be spawned.

Heres the code snippet -

ParallelOptions opt = new ParallelOptions();
opt.MaxDegreeOfParallelism = 5;

Parallel.ForEach(Directory.GetDirectories(Constants.RootFolder), opt, MyMethod);
whihathac
  • 1,741
  • 2
  • 22
  • 38
  • Note that this only controls the *maximum* number of threads - the system is still able to use fewer threads if it decides so. `MaxDegreesOfParallelism` is not a guarantee, only an upper bound. And if you do not set a value here, the default is based on the number of cores, system load, etc. – GalacticCowboy Apr 25 '16 at 20:15
4

In general, Parallel.ForEach() is quite good at optimizing the number of threads. It accounts for the number of cores in the system, but also takes into account what the threads are doing (CPU bound, IO bound, how long the method runs, etc.).

You can control the maximum degree of parallelization, but there's no mechanism to force more threads to be used.

Make sure your benchmarks are correct and can be compared in a fair manner (e.g. same websites, allow for a warm-up period before you start measuring, and do many runs since response time variance can be quite high scraping websites). If after careful measurement your own threading code is still faster, you can conclude that you have optimized for your particular case better than .NET and stick with your own code.

Lloyd
  • 29,197
  • 4
  • 84
  • 98
Eric J.
  • 147,927
  • 63
  • 340
  • 553
3

Something worth checking out is the TPL Dataflow library.

DataFlow on MSDN.

See Nesting await in Parallel.ForEach

The whole idea behind Parallel.ForEach() is that you have a set of threads and each processes part of the collection. As you noticed, this doesn't work with async-await, where you want to release the thread for the duration of the async call.

Also, the walkthrough Creating a Dataflow Pipeline specifically sets up and processes multiple web page downloads. TPL Dataflow really was designed for that scenario.

Community
  • 1
  • 1
Cameron MacFarland
  • 70,676
  • 20
  • 104
  • 133
  • That's a good pattern. I added a summary from the linked answer so that your answer stands better on its own. – Eric J. Feb 08 '13 at 17:06
  • There are also [ways to write a `ForEachAsync`](http://blogs.msdn.com/b/pfxteam/archive/2012/03/05/10278165.aspx), but IMO Dataflow is a great fit for this problem. +1. – Stephen Cleary Feb 08 '13 at 17:17
0

Hard to say without looking at your code and how the collection is defined, I've found that Parallel.Invoke is the most flexible. try msdn? ... sounds like you are looking to use Parallel.For Method (Int32, Int32, Action<Int32, ParallelLoopState>)

Chris Noffke
  • 317
  • 1
  • 7