3

I'm sort of making a 'web parser', except it is just for 1 website, which will be parsing many different pages all at one time.

Currently, There may be 300,000 pages that I need to parse, in a relatively fast manner (I'm only grabbing a tiny amount of information that doesn't take too long to go through, every page takes about ~3 seconds max on my network). Of course, 900,000 seconds to days = 10 days, and that is terrible performance. I would like to reduce this to a couple of hours at the most, I'm reasonable with the time to amount of requests, but it still needs to be 'fast'. I also know that I can't just do 300,000 all at one time, or the website will block all of my requests, so there will have to be a few seconds delay in between each and every request.

I currently have it processing in a single foreach loop, not taking advantage of any multithreading what so ever, but I know that I could take advantage of it, I'm not sure about what path I should take whether it be threadpools, or another type of threading system or design.

Basically, I'm looking for someone to point me in the right direction of efficiency using multithreading so that I can ease the time it will take to parse that many pages on my end, some sort of system or structure for threading.

Thanks

sl133
  • 1,339
  • 2
  • 15
  • 28
  • http://msdn.microsoft.com/en-us/library/dd537609.aspx – Chris Pfohl May 14 '13 at 17:44
  • Do you know all 300,000 links to start with, or are they discovered as you go along? – Jon Skeet May 14 '13 at 17:44
  • 3
    Look at the higher-level .NET multithreading concepts like `async` and the Task Parallel Library. – Robert Harvey May 14 '13 at 17:45
  • @jon - I know all 300,000 links to start with, so it's not really a crawler, just a parser. – sl133 May 14 '13 at 17:46
  • Sounds like `Parallel.ForEach` is your friend then. You should find out what traffic your website is prepared to handle. – Jon Skeet May 14 '13 at 17:47
  • In Parallel.ForEach, does it only run lets say 150,000 on one thread, and 150,000 on another? Because it would seem to be more efficient if you took advantage of 8 threads and exponentially decreased the amount to parse for each thread so it would be faster. – sl133 May 14 '13 at 17:50
  • @sl133 I suggest you try using it and see whether or not it results in effective scheduling. It's generally quite good. If you find that you have a problem then consider playing with the configuration parameters, and if that doesn't help then consider using a more customizable method. Don't worry too much about it until you find that you have a problem though. – Servy May 14 '13 at 17:54
  • Thanks, I'll try using it on a smaller amount of pages compared to not using it and see how much it increased performance. – sl133 May 14 '13 at 17:57

2 Answers2

7

Check out the answer to this question, as it sounds like you might want to check out Parallel.ForEach.

There are various other methods of achieving what you want to do in a multi-threaded fashion. To give yourself an idea of how this works:

  1. Download LINQPad. (should be a prerequisite to any C# developer, imho!)
  2. In the "Samples", "Download/import more samples..." and ensure you have "Asynchronous Functions in C#" downloaded.
  3. Work through the samples, seeing how they fit together.

In fact, here is one of the asynchronous examples that works with Uris:

// The await keyword is really useful when you want to run something in a loop. For instance:

string[] uris =
{
    "http://linqpad.net",
    "http://linqpad.net/downloadglyph.png",
    "http://linqpad.net/linqpadscreen.png",
    "http://linqpad.net/linqpadmed.png",
};

// Try doing the following without the await keyword!

int totalLength = 0;
foreach (string uri in uris)
{
    string html = await (new WebClient().DownloadStringTaskAsync (new Uri (uri)));
    totalLength += html.Length;
}
totalLength.Dump();

// The continuation is not just 'totalLength += html.Length', but the rest of the loop! (And that final
// call to 'totalLength.Dump()' at the end.)

// Logically, execution EXITS THE METHOD and RETURNS TO THE CALLER upon reaching the await statement. Rather
// like a 'yield return' (in fact, the compiler uses the same state-machine engine to rewrite asynchronous
// functions as it does iterators).
//
// When the task completes, the continuation kicks off and execution jumps back into the middle of the 
// loop - right where it left off!
Community
  • 1
  • 1
Moo-Juice
  • 38,257
  • 10
  • 78
  • 128
2

As others have mentioned using parallel processing can get you started. With .Net 4.0 or better its baked right in and it is very easy to use. If you have a foreach loop and want to add parallelization just add a reference to system.threading and add Parallel.ForEach. Keep in mind there are no limits to the amount of threads that are generated but there are a practical limitations. The built in .Net library shields you from the details and manages the thread pool for you automatically.

If you decide to alter the parameters, keep in mind the following:

Each thread will use about 1MB of stack space which means for every thousand your will use around 1 Gig of resources. But practically if you create 1000 threads processing will most likely be slower than using just one thread. Usually 1 thread per core/processor. So Quad core/Quad Processor a good thread count to start with is 16 and you may be able to tune it it up or down depending on exacly what your program is doing if you want to override how the library is doing it or roll your own. But I don't recommend doing that as the library does a good job of handling those internals for you automatically.

I would also mention that 1 machine probably will not meet your needs so your application should scale horizontally and be able to add new machines and parse more pages than just one machine by itself (multithreading or otherwise).

Jon Raynor
  • 3,804
  • 6
  • 29
  • 43