Best practics for parallelize web crawler in .net 4.0

Question

I need to download a lot of pages through proxies. What is best practice for building a multi-threaded web crawler?

Is Parallel.For\Foreach is good enough or is it better for heavy CPU tasks?

What do you say about following code?

var multyProxy = new MultyProxy();

   multyProxy.LoadProxyList();


   Task[] taskArray = new Task[1000];

        for(int i = 0; i < taskArray.Length; i++)
        {
            taskArray[i] = new Task( (obj) =>
                {                                                             
                       multyProxy.GetPage((string)obj);
                },

            (object)"http://google.com"
            );
            taskArray[i].Start();
        }


   Task.WaitAll(taskArray);

It's working horribly. It's very slow and I don't know why.

This code is also working bad.

 System.Threading.Tasks.Parallel.For(0,1000, new System.Threading.Tasks.ParallelOptions(){MaxDegreeOfParallelism=30},loop =>
            {
                 multyProxy.GetPage("http://google.com");
            }
            );

Well i think that i am doing something wrong.

When i starting my script it use network only at 2%-4%.

Martin Ernst · Accepted Answer · 2012-06-18T12:55:06.010

You are basically using up CPU bound threads for IO bound tasks - ie. even though you're parallelizing your operations, they're still using up essentially a ThreadPool thread, which is mainly intended for CPU bound operations.

Basically you need to use an async pattern for downloading the data to change it to using IO completion ports - if you're using WebRequest, then the BeginGetResponse() and EndGetResponse() methods

I would suggest looking at Reactive Extensions to do this, eg:

IEnumerable<string> urls = ... get your urls here...;
var results = from url in urls.ToObservable()
             let req = WebRequest.Create(url)
             from rsp in Observable.FromAsyncPattern<WebResponse>(
                  req.BeginGetResponse, req.EndGetResponse)()
             select ExtractResponse(rsp);

where ExtractResponse probably just uses a StreamReader.ReadToEnd to get the string results if that's what you're after

You can also look at using the .Retry operator then which will easily allow you to retry a few times if you get connection issues etc...

Thanks. I am a very new in Rx so how can i download for example 100 pages? Am i need to create 100 Observable and subscribe on each of them? — Neir0, May 21 '12 at 22:30
If you already have the full list of urls, you can create an IObservable from the list using .ToObservable - take a look at http://rxwiki.wikidot.com/101samples#toc11 — Martin Ernst, May 22 '12 at 07:41
Used same code but getting error "Type inference failed in the call to 'SelectMany'" at from rsp in Observable.FromAsyncPattern( — Vipul, Jun 15 '12 at 18:20

score 1 · Answer 2 · edited May 22 '12 at 08:44

1

Add this at the beginning of your main method:

System.Net.ServicePointManager.DefaultConnectionLimit = 100;

So you will not be limited to a tiny amount of concurrent connections.

edited May 22 '12 at 08:44

Simon

6,062
13
60
97

answered May 21 '12 at 15:51

Lakis

301
2
6

score 0 · Answer 3 · answered May 21 '12 at 15:50

This might help you when you use a lot of connections (add to app.config or web.config):

<?xml version="1.0" encoding="utf-8" ?>
<configuration>
  <system.net>
    <connectionManagement>
      <add address="*" maxconnection="50"/>
    </connectionManagement>
  </system.net>
</configuration>

Set your number of concurrent connections instead of 50

read more about it at http://msdn.microsoft.com/en-us/library/fb6y0fyc.aspx

Best practics for parallelize web crawler in .net 4.0

3 Answers3

Linked