Optimizing download of multiple web pages. C#

Question

I am developing an app where I need to download a bunch of web pages, preferably as fast as possible. The way that I do that right now is that I have multiple threads (100's) that have their own System.Net.HttpWebRequest. This sort of works, but I am not getting the performance I would like. Currently I have a beefy 600+ Mb/s connection to work with, and this is only utilized at most 10% (at peaks). I guess my strategy is flawed, but I am unable to find any other good way of doing this.

Also: If the use of HttpWebRequest is not a good way to download web pages, please say so :) The code has been semi-auto-converted from java.

Thanks :)

Update:

public String getPage(String link){
   myURL = new System.Uri(link);
   myHttpConn = (System.Net.HttpWebRequest)System.Net.WebRequest.Create(myURL);
   myStreamReader = new System.IO.StreamReader(new System.IO.StreamReader(myHttpConn.GetResponse().GetResponseStream(),
            System.Text.Encoding.Default).BaseStream,
                new System.IO.StreamReader(myHttpConn.GetResponse().GetResponseStream(),
                    System.Text.Encoding.Default).CurrentEncoding);

        System.Text.StringBuilder buffer = new System.Text.StringBuilder();

        //myLineBuff is a String
        while ((myLineBuff = myStreamReader.ReadLine()) != null)
        {
            buffer.Append(myLineBuff);
        }
   return buffer.toString();
}

Give us an discription of your current strategy. Perhaps even with code ;) — Stormenet, May 19 '11 at 16:55
Using 100's of threads probably won't help, as no PC I've ever heard of has that many logical cores. You should create a number of threads equal to the number of logical cores on the PC, and bump up the priority of them. Also, I wonder how much overhead your spending on making new System.Net.HttpWebRequest for each one? Can these not be reused somehow? How are you storing these pages? — MAW74656, May 19 '11 at 16:59
Have you tried utilizing gzip compression / deflate and webpage caching? Link about gzip / deflate http://stackoverflow.com/questions/1574168/deflate-compression-browser-compatibility-and-advantages-over-gzip Link about caching (lots of examples on google) http://www.dotnetperls.com/cache-aspnet — Brian Dishaw, May 19 '11 at 17:00
I am aware that 100's of threads usually isn't good, but it was the only way I could imagine that would let me download several pages at ones, or more specificly, wait for server response from several servers at ones (since this is what accually consumes the most time). @Brian: gzip, as I understand, is something that has to be done at the host of the pages. I have no control over the hosts, or am I mistaken? — Automatico, May 19 '11 at 17:11
The download speed depends on the limits of both ends and what's between, and also on how busy the other end is (how many other people it's also serving). — MRAB, May 19 '11 at 17:13
@MRAB: I am aware. I am downloading from several servers, so this is not the bottle neck as I see it. — Automatico, May 19 '11 at 17:16
My apologies, I miss read your request. You are correct, it appears that you don't have control over the hosts. — Brian Dishaw, May 19 '11 at 17:17
As to the number of threads, I wouldn't necessarily limit it to the number of logical cores. Sometimes there's a significant delay between the request and the response, and having several pending requests can help, especially if each download takes less time than the delay, but 100s of them is way too many. — MRAB, May 19 '11 at 17:19
@Cort2z Your comment regarding several servers, but hundreds of downloads brings up a concern: Many HTTP servers limit the number of simultaneous connections to a given client. You may be running into a problem where the remote servers simply don't allow you more than x simultaneous downloads. — JYelton, May 19 '11 at 17:28
@JYelton: Fascinating. I haven considered this. But still. I do believe that the ammount of servers I contact is greater than the number of threads I got running. But indeed, I should check out how many of my threads are contacting the same server at ones. This might be limiting me to some degree. — Automatico, May 19 '11 at 17:34
Could you prioritize the slower connections based on link quality? For instance let those that are going to finish, finish quickly. Then have a housekeeping thread that looks at longer running threads and pushes them back to give those that haven't run yet a chance to run and finish. — Todd Richardson, May 19 '11 at 17:48
@fauxtrot: I guess I could do that. But essentially this is done with the timeout ability in the httpWebRequest. It will kill of the webpages that takes too long. — Automatico, May 19 '11 at 17:53
@Brian @Cort3z If the servers support gzip, does a default instance of webrequest include headers to indicate acceptance of gzip content? gzipping has to be accepted by the client before a standards compliant server will serve it. — Todd Richardson, May 19 '11 at 17:53
@fauxtrot: [GZip in HttpWebRequest](http://www.west-wind.com/weblog/posts/2007/Jun/29/HttpWebRequest-and-GZip-Http-Responses) — Automatico, May 19 '11 at 18:00
Try setting `myHttpConn.Proxy = null;`, as @JYelton suggests, before `myHttpConn.GetResponse()`. — Dour High Arch, May 19 '11 at 20:15
@Dour High Arch: This does have a profound effect on my software. The effct is very strange. It starts out A LOT faster(2x), but then tapers of quickly to less than usual performance. I will test this further. — Automatico, May 19 '11 at 22:15

score 2 · Accepted Answer · answered May 19 '11 at 20:20

One problem is that it appears you're issuing each request twice:

myStreamReader = new System.IO.StreamReader(
    new System.IO.StreamReader(
        myHttpConn.GetResponse().GetResponseStream(),
        System.Text.Encoding.Default).BaseStream,
             new System.IO.StreamReader(myHttpConn.GetResponse().GetResponseStream(),
                 System.Text.Encoding.Default).CurrentEncoding);

It makes two calls to GetResponse. For reasons I fail to understand, you're also creating two stream readers. You can split that up and simplify it, and also do a better job of error handling...

var response = (HttpWebResponse)myHttpCon.GetResponse();
myStreamReader = new StreamReader(response.GetResponseStream(), Encoding.Default)

That should double your effective throughput.

Also, you probably want to make sure to dispose of the objects you're using. When you're downloading a lot of pages, you can quickly run out of resources if you don't clean up after yourself. In this case, you should call response.Close(). See http://msdn.microsoft.com/en-us/library/system.net.httpwebresponse.close.aspx

Yes! Thank you! That improved my code quite a bit! It's not quite a 2x improvement, but it is a +50% boost! thanks a lot :) I do, by the way, close the connection after I have used it:) (forgot to put it in here) — Automatico, May 28 '11 at 22:33

score 2 · Answer 2 · answered Aug 04 '11 at 16:12

I am adding this answer as another possibility which people may encounter when

downloading from multiple servers using multi-threaded apps
using Windows XP or Vista as the operating system

The tcpip.sys driver for these operating systems has a limit of 10 outbound connections per second. This is a rate limit, not a connection limit, so you can have hundreds of connections, but you cannot initiate more than 10/s. The limit was imposed by Microsoft to curtail the spread of certain types of virus/worm. Whether such methods are effective is outside the scope of this answer.

In a multi-threaded application that downloads from multitudes of servers, this limitation can manifest as a series of timeouts. Windows puts into a queue all of the "half-open" (newly open but not yet established) connections once the 10/s limit is reached. In my application, for example, I had 20 threads ready to process connections, but I found that sometimes I would get timeouts from servers I knew were operating and reachable.

To verify that this is happening, check the operating system's event log, under System. The error is:

EventID 4226: TCP/IP has reached the security limit imposed on the number of concurrent TCP connect attempts.

There are many references to this error and plenty of patches and fixes to apply to remove the limit. However because this problem is frequently encountered by P2P (Torrent) users, there's quite a prolific amount of malware disguised as this patch.

I have a requirement to collect data from over 1200 servers (that are actually data sensors) on 5-minute intervals. I initially developed the application (on WinXP) to reuse 20 threads repeatedly to crawl the list of servers and aggregate the data into a SQL database. Because the connections were initiated based on a timer tick event, this error happened often because at their invocation, none of the connections are established, thus 10 are immediately queued.

Note that this isn't a problem necessarily, because as connections are established, those queued are then processed. However if non-queued connections are slow to establish, that time can negatively impact the timeout limits of the queued connections (in my experience). The result, looking at my application log file, was that I would see a batch of connections that timed out, followed by a majority of connections that were successful. Opening a web browser to test "timed out" connections was confusing, because the servers were available and quick to respond.

I decided to try HEX editing the tcpip.sys file, which was suggested on a guide at speedguide.net. The checksum of my file differed from the guide (I had SP3 not SP2) and comments in the guide weren't necessarily helpful. However, I did find a patch that worked for SP3 and noticed an immediate difference after applying it.

From what I can find, Windows 7 does not have this limitation, and since moving the application to a Windows 7-based machine, the timeout problem has remained absent.

JYelton · Answer 3 · 2011-05-19T17:32:14.567

I do this very same thing, but with thousands of sensors that provide XML and Text content. Factors that will definitely affect performance are not limited to the speed and power of your bandwidth and computer, but the bandwidth and response time of each server you are contacting, the timeout delays, the size of each download, and the reliability of the remote internet connections.

As comments indicate, hundreds of threads is not necessarily a good idea. Currently I've found that running between 20 and 50 threads at a time seems optimal. In my technique, as each thread completes a download, it is given the next item from a queue.

I run a custom ThreaderEngine Class on a separate thread that is responsible for maintaining the queue of work items and assigning threads as needed. Essentially it is a while loop that iterates through an array of threads. As the threads finish, it grabs the next item from the queue and starts the thread again.

Each of my threads are actually downloading several separate items, but the method call is the same (.NET 4.0):

public static string FileDownload(string _ip, int _port, string _file, int Timeout, int ReadWriteTimeout, NetworkCredential _cred = null)
{
    string uri = String.Format("http://{0}:{1}/{2}", _ip, _port, _file);
    string Data = String.Empty;
    try
    {
        HttpWebRequest Request = (HttpWebRequest)WebRequest.Create(uri);
        if (_cred != null) Request.Credentials = _cred;
        Request.Timeout = Timeout; // applies to .GetResponse()
        Request.ReadWriteTimeout = ReadWriteTimeout; // applies to .GetResponseStream()
        Request.Proxy = null;
        Request.CachePolicy = new System.Net.Cache.RequestCachePolicy(System.Net.Cache.RequestCacheLevel.NoCacheNoStore);
        using (HttpWebResponse Response = (HttpWebResponse)Request.GetResponse())
        {
            using (Stream dataStream = Response.GetResponseStream())
            {
                if (dataStream != null)
                    using (BufferedStream buffer = new BufferedStream(dataStream))
                    using (StreamReader reader = new StreamReader(buffer))
                    {
                        Data = reader.ReadToEnd();
                    }
            }
            return Data;
        }
    }
    catch (AccessViolationException ave)
    {
        // ...
    }
    catch (Exception exc)
    {
        // ...
    }
}

Using this I am able to download about 60KB each from 1200+ remote machines (72MB) in less than 5 minutes. The machine is a Core 2 Quad with 2GB RAM and utilizes four bonded T1 connections (~6Mbps).

Well, this is accually a very good description of what I do:p I too use another thread to assign work and stuff, but I am really disapointed with the performance of the download. Currently I am getting about 30 pages/second. This number should, and could, be a lot higher, possibly 10-fold, compared to the internetspeed I got. I barely use any cpu at all (27% peak of a 6-core machine). — Automatico, May 19 '11 at 17:32
@Cort3z See my comment on your question - I am wondering if the remote servers are limiting the number of simultaneous connections you can establish. — JYelton, May 19 '11 at 17:32

Optimizing download of multiple web pages. C#

3 Answers3