8

My application requires that I download a large amount of webpages into memory for further parsing and processing. What is the fastest way to do it? My current method (shown below) seems to be too slow and occasionally results in timeouts.

for (int i = 1; i<=pages; i++)
{
    string page_specific_link = baseurl + "&page=" + i.ToString();

    try
    {    
        WebClient client = new WebClient();
        var pagesource = client.DownloadString(page_specific_link);
        client.Dispose();
        sourcelist.Add(pagesource);
    }
    catch (Exception)
    {
    }
}
casperOne
  • 73,706
  • 19
  • 184
  • 253
paradox
  • 1,248
  • 5
  • 20
  • 32
  • 4
    You need a T1 connection – H H Sep 19 '11 at 17:41
  • 2
    Since many answers are suggesting parallel fetching, I want to warn you against sending too many concurrent requests; you may get banned if the web-site is not friendly. Also there will be a limit to how much each additional thread helps and beyond a point it will cause degradation. – Miserable Variable Sep 19 '11 at 17:56
  • @Hemal Pandya: A valid concern, that's not *that* much of a concern; the `WebClient` class ultimately will use the `HttpWebRequest`/`HttpWebResponse` classes which use the `ServicePointManager` class. The `ServicePointManager` by default will limit most downloads to two at a time for a particular domain (as per the recommendation in the HTTP 1.1 specification). – casperOne Sep 19 '11 at 18:29
  • @casperOne I didn't know about `ServicePointManager`, I was just comparing it to issuing a bunch of `wget ... &` on command line. I didn't know about the HTTP 1.1. recommendation but it seems too few in this time and age. OP will probably want to override it IMHO. – Miserable Variable Sep 19 '11 at 18:38

7 Answers7

6

The way you approach this problem is going to depend very much on how many pages you want to download, and how many sites you're referencing.

I'll use a good round number like 1,000. If you want to download that many pages from a single site, it's going to take a lot longer than if you want to download 1,000 pages that are spread out across dozens or hundreds of sites. The reason is that if you hit a single site with a whole bunch of concurrent requests, you'll probably end up getting blocked.

So you have to implement a type of "politeness policy," that issues a delay between multiple requests on a single site. The length of that delay depends on a number of things. If the site's robots.txt file has a crawl-delay entry, you should respect that. If they don't want you accessing more than one page per minute, then that's as fast as you should crawl. If there's no crawl-delay, you should base your delay on how long it takes a site to respond. For example, if you can download a page from the site in 500 milliseconds, you set your delay to X. If it takes a full second, set your delay to 2X. You can probably cap your delay to 60 seconds (unless crawl-delay is longer), and I would recommend that you set a minimum delay of 5 to 10 seconds.

I wouldn't recommend using Parallel.ForEach for this. My testing has shown that it doesn't do a good job. Sometimes it over-taxes the connection and often it doesn't allow enough concurrent connections. I would instead create a queue of WebClient instances and then write something like:

// Create queue of WebClient instances
BlockingCollection<WebClient> ClientQueue = new BlockingCollection<WebClient>();
// Initialize queue with some number of WebClient instances

// now process urls
foreach (var url in urls_to_download)
{
    var worker = ClientQueue.Take();
    worker.DownloadStringAsync(url, ...);
}

When you initialize the WebClient instances that go into the queue, set their OnDownloadStringCompleted event handlers to point to a completed event handler. That handler should save the string to a file (or perhaps you should just use DownloadFileAsync), and then the client, adds itself back to the ClientQueue.

In my testing, I've been able to support 10 to 15 concurrent connections with this method. Any more than that and I run into problems with DNS resolution (`DownloadStringAsync' doesn't do the DNS resolution asynchronously). You can get more connections, but doing so is a lot of work.

That's the approach I've taken in the past, and it's worked very well for downloading thousands of pages quickly. It's definitely not the approach I took with my high performance Web crawler, though.

I should also note that there is a huge difference in resource usage between these two blocks of code:

WebClient MyWebClient = new WebClient();
foreach (var url in urls_to_download)
{
    MyWebClient.DownloadString(url);
}

---------------

foreach (var url in urls_to_download)
{
    WebClient MyWebClient = new WebClient();
    MyWebClient.DownloadString(url);
}

The first allocates a single WebClient instance that is used for all requests. The second allocates one WebClient for each request. The difference is huge. WebClient uses a lot of system resources, and allocating thousands of them in a relatively short time is going to impact performance. Believe me ... I've run into this. You're better off allocating just 10 or 20 WebClients (as many as you need for concurrent processing), rather than allocating one per request.

Jim Mischel
  • 131,090
  • 20
  • 188
  • 351
  • i've read somewhere that manually resolving the dns for the site and using it for DownloadStringAsync helps performance. Ever tried that Jim? – paradox Sep 19 '11 at 17:41
  • @paradox: Yes, you resolve the DNS ahead of time so that it's likely to be in your machine's DNS resolver cache. I do something quite similar to that in my crawler, and I can get upwards of 100 connections per second by doing it. It's kind of a pain to do for a simple downloading application, though. Note, though, that for a single request, doing the DNS and then making the request isn't going to execute any faster than just issuing the request. Resolving the DNS ahead of time only makes things faster if you can do it while other pages are being downloaded. – Jim Mischel Sep 19 '11 at 18:03
  • what about the parallel foreach done this way ? https://stackoverflow.com/questions/46284818/parallel-request-to-scrape-multiple-pages-of-a-website – sofsntp Sep 26 '17 at 08:43
  • @sofsntp: it works, although the `Sleep` loop is dissatisfying. He's basically limiting the number of threads in much the same way I am. He's just using more code to do it. – Jim Mischel Sep 26 '17 at 12:58
  • @sofsntp: If you're having trouble, post a question, including a small sample application that illustrates the error. I can't really help you without seeing code. – Jim Mischel Sep 26 '17 at 17:26
  • For this I recommend using Fillmore's solution on throttling : https://joelfillmore.wordpress.com/2011/04/01/throttling-web-api-calls/#comment-204 – PinoyDev Jan 29 '18 at 18:47
4

Why not just use a web crawling framework. It can handle all the stuff for you like (multithreading, httprequests, parsing links, scheduling, politeness, etc..).

Abot (https://code.google.com/p/abot/) handles all that stuff for you and is written in c#.

sjdirect
  • 2,224
  • 2
  • 22
  • 27
  • 2
    I've been using Abot for a few months now, and have found it highly extensible and very well written. It's also well managed, so there are pretty regular updates to the code base. You have the option to tweak how your crawler appears as a client, to respect robots, and inject your own handlers with the ability to extend the other built in other classes. – jamesbar2 Oct 08 '14 at 00:42
2

In addition to @Davids perfectly valid answer, I want to add a slightly cleaner "version" of his approach.

var pages = new List<string> { "http://bing.com", "http://stackoverflow.com" };
var sources = new BlockingCollection<string>();

Parallel.ForEach(pages, x =>
{
    using(var client = new WebClient())
    {
        var pagesource = client.DownloadString(x);
        sources.Add(pagesource);
    }
});

Yet another approach, that uses async:

static IEnumerable<string> GetSources(List<string> pages)
{
    var sources = new BlockingCollection<string>();
    var latch = new CountdownEvent(pages.Count);

    foreach (var p in pages)
    {
        using (var wc = new WebClient())
        {
            wc.DownloadStringCompleted += (x, e) =>
            {
                sources.Add(e.Result);
                latch.Signal();
            };

            wc.DownloadStringAsync(new Uri(p));
        }
    }

    latch.Wait();

    return sources;
}
Community
  • 1
  • 1
ebb
  • 9,297
  • 18
  • 72
  • 123
1

You should use parallel programming for this purpose.

There are a lot of ways to achieve what u want; the easiest would be something like this:

var pageList = new List<string>();

for (int i = 1; i <= pages; i++)
{
  pageList.Add(baseurl + "&page=" + i.ToString());
}


// pageList  is a list of urls
Parallel.ForEach<string>(pageList, (page) =>
{
  try
    {
      WebClient client = new WebClient();
      var pagesource = client.DownloadString(page);
      client.Dispose();
      lock (sourcelist)
      sourcelist.Add(pagesource);
    }

    catch (Exception) {}
});
apaderno
  • 28,547
  • 16
  • 75
  • 90
David
  • 2,551
  • 3
  • 34
  • 62
  • 1
    It's also wrong as it's writing to `sourcelist` without synchronizing access to it. There's a good chance that list is going to be corrupted as a result. – casperOne Sep 19 '11 at 16:56
  • `foreach` does not run in parallel even if you use `AsParallel`. you have to use `Parallel.ForEach`. – Daniel Sep 19 '11 at 17:05
  • If you are using the latest Parallel code you might want to use the Concurrent Collections too: http://msdn.microsoft.com/en-us/library/system.collections.concurrent.aspx instead lock()s – Ian Mercer Sep 19 '11 at 17:16
0

I am using an active Threads count and a arbitrary limit:

private static volatile int activeThreads = 0;

public static void RecordData()
{
  var nbThreads = 10;
  var source = db.ListOfUrls; // Thousands urls
  var iterations = source.Length / groupSize; 
  for (int i = 0; i < iterations; i++)
  {
    var subList = source.Skip(groupSize* i).Take(groupSize);
    Parallel.ForEach(subList, (item) => RecordUri(item)); 
    //I want to wait here until process further data to avoid overload
    while (activeThreads > 30) Thread.Sleep(100);
  }
}

private static async Task RecordUri(Uri uri)
{
   using (WebClient wc = new WebClient())
   {
      Interlocked.Increment(ref activeThreads);
      wc.DownloadStringCompleted += (sender, e) => Interlocked.Decrement(ref iterationsCount);
      var jsonData = "";
      RootObject root;
      jsonData = await wc.DownloadStringTaskAsync(uri);
      var root = JsonConvert.DeserializeObject<RootObject>(jsonData);
      RecordData(root)
    }
}
sofsntp
  • 1,964
  • 23
  • 34
0

I Had a similar Case ,and that's how i solved

using System;
    using System.Threading;
    using System.Collections.Generic;
    using System.Net;
    using System.IO;

namespace WebClientApp
{
class MainClassApp
{
    private static int requests = 0;
    private static object requests_lock = new object();

    public static void Main() {

        List<string> urls = new List<string> { "http://www.google.com", "http://www.slashdot.org"};
        foreach(var url in urls) {
            ThreadPool.QueueUserWorkItem(GetUrl, url);
        }

        int cur_req = 0;

        while(cur_req<urls.Count) {

            lock(requests_lock) {
                cur_req = requests; 
            }

            Thread.Sleep(1000);
        }

        Console.WriteLine("Done");
    }

private static void GetUrl(Object the_url) {

        string url = (string)the_url;
        WebClient client = new WebClient();
        Stream data = client.OpenRead (url);

        StreamReader reader = new StreamReader(data);
        string html = reader.ReadToEnd ();

        /// Do something with html
        Console.WriteLine(html);

        lock(requests_lock) {
            //Maybe you could add here the HTML to SourceList
            requests++; 
        }
    }
}

You should think using Paralel's because the slow speed is because you're software is waiting for I/O and why not while a thread i waiting for I/O another one get started.

Rosmarine Popcorn
  • 10,761
  • 11
  • 59
  • 89
0

While the other answers are perfectly valid, all of them (at the time of this writing) are neglecting something very important: calls to the web are IO bound, having a thread wait on an operation like this is going to strain system resources and have an impact on your system resources.

What you really want to do is take advantage of the async methods on the WebClient class (as some have pointed out) as well as the Task Parallel Library's ability to handle the Event-Based Asynchronous Pattern.

First, you would get the urls that you want to download:

IEnumerable<Uri> urls = pages.Select(i => new Uri(baseurl + 
    "&page=" + i.ToString(CultureInfo.InvariantCulture)));

Then, you would create a new WebClient instance for each url, using the TaskCompletionSource<T> class to handle the calls asynchronously (this won't burn a thread):

IEnumerable<Task<Tuple<Uri, string>> tasks = urls.Select(url => {
    // Create the task completion source.
    var tcs = new TaskCompletionSource<Tuple<Uri, string>>();

    // The web client.
    var wc = new WebClient();

    // Attach to the DownloadStringCompleted event.
    client.DownloadStringCompleted += (s, e) => {
        // Dispose of the client when done.
        using (wc)
        {
            // If there is an error, set it.
            if (e.Error != null) 
            {
                tcs.SetException(e.Error);
            }
            // Otherwise, set cancelled if cancelled.
            else if (e.Cancelled) 
            {
                tcs.SetCanceled();
            }
            else 
            {
                // Set the result.
                tcs.SetResult(new Tuple<string, string>(url, e.Result));
            }
        }
    };

    // Start the process asynchronously, don't burn a thread.
    wc.DownloadStringAsync(url);

    // Return the task.
    return tcs.Task;
});

Now you have an IEnumerable<T> which you can convert to an array and wait on all of the results using Task.WaitAll:

// Materialize the tasks.
Task<Tuple<Uri, string>> materializedTasks = tasks.ToArray();

// Wait for all to complete.
Task.WaitAll(materializedTasks);

Then, you can just use Result property on the Task<T> instances to get the pair of the url and the content:

// Cycle through each of the results.
foreach (Tuple<Uri, string> pair in materializedTasks.Select(t => t.Result))
{
    // pair.Item1 will contain the Uri.
    // pair.Item2 will contain the content.
}

Note that the above code has the caveat of not having an error handling.

If you wanted to get even more throughput, instead of waiting for the entire list to be finished, you could process the content of a single page after it's done downloading; Task<T> is meant to be used like a pipeline, when you've completed your unit of work, have it continue to the next one instead of waiting for all of the items to be done (if they can be done in an asynchronous manner).

casperOne
  • 73,706
  • 19
  • 184
  • 253
  • Passing along a (rejected) suggested edit: *DownloadStringAsync dosen't take an overload for "string" - only for "Uri".* – user7116 Sep 19 '11 at 21:16
  • @sixlettervariables: Thanks for the suggestion; modified it to use `Uri` the entire way through. – casperOne Sep 19 '11 at 21:49
  • This looks like pseduocode. You are missing `>` in several places. Ex: here => `IEnumerable> tasks` The code won't compile and certain types are wrong. – Shiva Apr 10 '17 at 02:02
  • @Shiva Feel free to edit to correct it. Also, eyeballing, that's the only place I see a mismanaged set of angle brackets. – casperOne Apr 10 '17 at 18:51