Multithreading a web scraper?

Question

I've been thinking about making my web scraper multithreaded, not like normal threads (egThread scrape = new Thread(Function);) but something like a threadpool where there can be a very large number of threads.

My scraper works by using a for loop to scrape pages.

for (int i = (int)pagesMin.Value; i <= (int)pagesMax.Value; i++)

So how could I multithread the function (that contains the loop) with something like a threadpool? I've never used threadpools before and the examples I've seen have been quite confusing or obscure to me.

I've modified my loop into this:

int min = (int)pagesMin.Value;
int max = (int)pagesMax.Value;
ParallelOptions pOptions = new ParallelOptions();
pOptions.MaxDegreeOfParallelism = Properties.Settings.Default.Threads;
Parallel.For(min, max, pOptions, i =>{
    //Scraping
});

Would that work or have I got something wrong?

Why not use `Parallel.For` or `TaskFactory.StartNew`? – Peter Ritchie Apr 20 '13 at 00:42 — Peter Ritchie, Apr 20 '13 at 00:42

score 5 · Answer 1 · answered Apr 20 '13 at 01:12

The problem with using pool threads is that they spend most of their time waiting for a response from the Web site. And the problem with using Parallel.ForEach is that it limits your parallelism.

I got the best performance by using asynchronous Web requests. I used a Semaphore to limit the number of concurrent requests, and the callback function did the scraping.

The main thread creates the Semaphore, like this:

Semaphore _requestsSemaphore = new Semaphore(20, 20);

The 20 was derived by trial and error. It turns out that the limiting factor is DNS resolution and, on average, it takes about 50 ms. At least, it did in my environment. 20 concurrent requests was the absolute maximum. 15 is probably more reasonable.

The main thread essentially loops, like this:

while (true)
{
    _requestsSemaphore.WaitOne();
    string urlToCrawl = DequeueUrl();  // however you do that
    var request = (HttpWebRequest)WebRequest.Create(urlToCrawl);
    // set request properties as appropriate
    // and then do an asynchronous request
    request.BeginGetResponse(ResponseCallback, request);
}

The ResponseCallback method, which will be called on a pool thread, does the processing, disposes of the response, and then releases the semaphore so that another request can be made.

void ResponseCallback(IAsyncResult ir)
{
    try
    {
        var request = (HttpWebRequest)ir.AsyncState;
        // you'll want exception handling here
        using (var response = (HttpWebResponse)request.EndGetResponse(ir))
        {
            // process the response here.
        }
    }
    finally
    {
        // release the semaphore so that another request can be made
        _requestSemaphore.Release();
    }
}

The limiting factor, as I said, is DNS resolution. It turns out that DNS resolution is done on the calling thread (the main thread in this case). See Is this really asynchronous? for more information.

This is simple to implement and works quite well. It's possible to get even more than 20 concurrent requests, but doing so takes quite a bit of effort, in my experience. I had to do a lot of DNS caching and ... well, it was difficult.

You can probably simplify the above by using Task and the new async stuff in C# 5.0 (.NET 4.5). I'm not familiar enough with those to say how, though.

score 3 · Answer 2 · edited Mar 13 '14 at 23:08

It's better to go with the TPL, namely Parallel.ForEach using an overload with a Partitioner. It manages workload automatically.

FYI. You should understand that more threads doesn't mean faster. I'd advice you to make some tests to compare unparametrized Parallel.ForEach and user defined.

Update

    public void ParallelScraper(int fromInclusive, int toExclusive,
                                Action<int> scrape, int desiredThreadsCount)
    {
        int chunkSize = (toExclusive - fromInclusive +
            desiredThreadsCount - 1) / desiredThreadsCount;
        ParallelOptions pOptions = new ParallelOptions
        {
            MaxDegreeOfParallelism = desiredThreadsCount
        };

        Parallel.ForEach(Partitioner.Create(fromInclusive, toExclusive, chunkSize),
            rng =>
            {
                for (int i = rng.Item1; i < rng.Item2; i++)
                    scrape(i);
            });
    }

Note You could be better with async in your situation.

I'm not quite understanding why I should use `Parallel.ForEach` and not `Parallel.For` — Prime, Apr 20 '13 at 01:08
`Parallel.ForEach` gives you ability to chose partition size automatically or set it up manually. Adding parallel options to this you could force desired level of parallelism (number of threads) if needed. Generally, TPL manages number of threads in an effective fashion (scaling up to number of cores if I\m not mistaken), but for some IO bound tasks you may want to use more threads than cores. — Ivan Nechipayko, Apr 20 '13 at 02:04

cat916 · Answer 3 · 2013-04-20T01:22:35.170

2

If you think your web scraper like using for loop, you could have a look at Parallel.ForEach() that would similar to foreach loop; however, in that, it iterates over an enumerable data. Parallel.ForEach use multiple threads to invoke loop body.

For more details, see Parallel loops

Update:

Parallel.For() is very similar to Parallel.ForEach(), it depends on the context like you use for or foreach loop.

edited Apr 20 '13 at 01:22

answered Apr 20 '13 at 00:53

cat916

1,363
10
18

score 0 · Answer 4 · answered Apr 20 '13 at 02:57

0

This is a perfect scenario for TPL Dataflow's ActionBlock. You can easily configure it to limit concurrency. Here is one of the examples from the documentation:

var downloader = new ActionBlock<string>(async url =>
{
    byte [] imageData = await DownloadAsync(url);
    Process(imageData);
}, new DataflowBlockOptions { MaxDegreeOfParallelism = 5 }); 

downloader.Post("http://msdn.com/concurrency ");
downloader.Post("http://blogs.msdn.com/pfxteam");

You can read about ActionBlock (including the referenced example) by downloading Introduction to TPL Dataflow.

answered Apr 20 '13 at 02:57

David Peden

17,596
6
52
72

I don't think I have Dataflow functions, I'm using Visual Studio 2010 so I only have up to .NET 4.0 not 4.5 – Prime Apr 20 '13 at 03:53
I added the .NET 4.0 tag to your post since that is relevant information. See this post if you're still interested in this option: http://stackoverflow.com/questions/15338907/where-can-i-find-a-tpl-dataflow-version-for-4-0. – David Peden Apr 20 '13 at 04:13

score 0 · Answer 5 · answered Aug 30 '13 at 19:57

During the tests for our "Crawler-Lib Framework" I found that parallel, TPL or threading attempts won't get you the throughput you want to have. You stuck on 300-500 requests per second on a local machine. If you want to execute thousands of requests in parallel, you must execute them async pattern and process the results in parallel. Our Crawler-Lib Engine (a workflow enabled request processor) does this with about 10.000 - 20.000 requests / second on a local machine. If you want to have a fast scraper don't try to use TPL. Instead use the Async Pattern (Begin... End...) and start all your requests in one thread.

If many of your requests tend to time out lets say after 30 seconds, the situation is even worse. In tis case the TPL based solutions will get an ugly bad throughput of 5? 1? requests per second. The async pattern gives you at least 100-300 requests per second. The Crawler-Lib Engine handles this well and get the maximum possible requests. Lets say your TCP/IP tack is configured to have 60000 outbound connections (65535 is the maximum, because every connection need a outbound port) then you will get a throughput of 60000 connections / 30 seconds timeout = 2000 requests / second.

Multithreading a web scraper?

5 Answers5