21

I'm trying to implement a limited web crawler in C# (for a few hundred sites only) using HttpWebResponse.GetResponse() and Streamreader.ReadToEnd() , also tried using StreamReader.Read() and a loop to build my HTML string.

I'm only downloading pages which are about 5-10K.

It's all very slow! For example, the average GetResponse() time is about half a second, while the average StreamREader.ReadToEnd() time is about 5 seconds!

All sites should be very fast, as they are very close to my location, and have fast servers. (in Explorer takes practically nothing to D/L) and I am not using any proxy.

My Crawler has about 20 threads reading simultaneously from the same site. Could this be causing a problem?

How do I reduce StreamReader.ReadToEnd times DRASTICALLY?

John Saunders
  • 160,644
  • 26
  • 247
  • 397
Roey
  • 849
  • 2
  • 11
  • 20

9 Answers9

16

HttpWebRequest may be taking a while to detect your proxy settings. Try adding this to your application config:

<system.net>
  <defaultProxy enabled="false">
    <proxy/>
    <bypasslist/>
    <module/>
  </defaultProxy>
</system.net>

You might also see a slight performance gain from buffering your reads to reduce the number of calls made to the underlying operating system socket:

using (BufferedStream buffer = new BufferedStream(stream))
{
  using (StreamReader reader = new StreamReader(buffer))
  {
    pageContent = reader.ReadToEnd();
  }
}
kgriffs
  • 4,080
  • 5
  • 37
  • 42
8

WebClient's DownloadString is a simple wrapper for HttpWebRequest, could you try using that temporarily and see if the speed improves? If things get much faster, could you share your code so we can have a look at what may be wrong with it?

EDIT:

It seems HttpWebRequest observes IE's 'max concurrent connections' setting, are these URLs on the same domain? You could try increasing the connections limit to see if that helps? I found this article about the problem:

By default, you can't perform more than 2-3 async HttpWebRequest (depends on the OS). In order to override it (the easiest way, IMHO) don't forget to add this under section in the application's config file:

<system.net>
  <connectionManagement>
     <add address="*" maxconnection="65000" />
  </connectionManagement>
</system.net>
Matt Brindley
  • 9,739
  • 7
  • 47
  • 51
  • Tried using WebClient, same results (average times have not changed). I should also mention that I have a 1.5MBPS connection with an average d/l speed of 180KBPS I was thinking that maybe 20 threads all calling StreamReader.Read at the same time could have something to do with it? Or is this irrelevant? – Roey May 23 '09 at 11:59
  • In my experience, on a connection like that you will saturate the bandwidth with 3-4 threads. No need to run more unless the websites you are pinging are really slow and you have threads sleeping a lot, waiting on I/O. – kgriffs Dec 24 '09 at 17:02
  • 1
    wow!!! I was using async HttpWebRequest to load test server with about 300 threads per client and each thread was downloading "serially". changing maxconnection setting made each thread download data 10x faster. – vivek.m Jun 25 '12 at 09:25
4

I had the same problem, but when I sat the HttpWebRequest's Proxy parameter to null, it solved the problem.

UriBuilder ub = new UriBuilder(url);
HttpWebRequest request = (HttpWebRequest)WebRequest.Create( ub.Uri );
request.Proxy = null;
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
bisand
  • 121
  • 4
1

Have you tried ServicePointManager.maxConnections? I usually set it to 200 for things similar to this.

No Refunds No Returns
  • 8,092
  • 4
  • 32
  • 43
1

I had problem the same problem but worst. response = (HttpWebResponse)webRequest.GetResponse(); in my code delayed about 10 seconds before running more code and after this the download saturated my connection.

kurt's answer defaultProxy enabled="false"

solved the problem. now the response is almost instantly and i can download any http file at my connections maximum speed :) sorry for bad english

vt2
  • 11
  • 1
1

I found the Application Config method did not work, but the problem was still due to the proxy settings. My simple request used to take up to 30 seconds, now it takes 1.

public string GetWebData()
{
            string DestAddr = "http://mydestination.com";
            System.Net.WebClient myWebClient = new System.Net.WebClient();
            WebProxy myProxy = new WebProxy();
            myProxy.IsBypassed(new Uri(DestAddr));
            myWebClient.Proxy = myProxy;
            return myWebClient.DownloadString(DestAddr);
}
thunder
  • 11
  • 2
0

Why wouldn't multithreading solve this issue? Multithreading would minimize the network wait times, and since you'd be storing the contents of the buffer in system memory (RAM), there would be no IO bottleneck from dealing with a filesystem. Thus, your 82 pages that take 82 seconds to download and parse, should take like 15 seconds (assuming a 4x processor). Correct me if I'm missing something.

____ DOWNLOAD THREAD_____*

Download Contents

Form Stream

Read Contents

_________________________*

Pangamma
  • 731
  • 12
  • 28
0

Try to add cookie(AspxAutoDetectCookieSupport=1) to your request like this

request.CookieContainer = new CookieContainer();         
request.CookieContainer.Add(new Cookie("AspxAutoDetectCookieSupport", "1") { Domain = target.Host });
ashkufaraz
  • 5,179
  • 6
  • 51
  • 82
0

Thank you all for answers, they've helped me to dig in proper direction. I've faced with the same performance issue, though proposed solution to change application config file (as I understood that solution is for web applications) doesn't fit my needs, my solution is shown below:

HttpWebRequest webRequest;

webRequest = (HttpWebRequest)System.Net.WebRequest.Create(fullUrl);
webRequest.Method = WebRequestMethods.Http.Post;

if (useDefaultProxy)
{
    webRequest.Proxy = System.Net.WebRequest.DefaultWebProxy;
    webRequest.Credentials = CredentialCache.DefaultCredentials;
}
else
{
    System.Net.WebRequest.DefaultWebProxy = null;
    webRequest.Proxy = System.Net.WebRequest.DefaultWebProxy;
}
Yuriy
  • 84
  • 3