0

I'm making a Web Crawler and I just found out that one of my methods, GetHTML, is very slow because it uses a StreamReader to get a string of the HTML out of the HttpWebResponse object.

Here is the method:

static string GetHTML(string URL)
      {
           HttpWebRequest Request = (HttpWebRequest)WebRequest.Create(URL);
           Request.Proxy = null;
           HttpWebResponse Response = ((HttpWebResponse)Request.GetResponse());
           Stream RespStream = Response.GetResponseStream();
           return new StreamReader(RespStream).ReadToEnd(); // Very slow
      }

I made a test with Stopwatch and used this method on YouTube.

Time it takes to get an HTTP response: 500 MS

Time it takes to convert the HttpWebResponse object to a string: 550 MS

So the HTTP request is fine, it's just the ReadToEnd() that is so slow.

Is there any alternative to the ReadToEnd() method to get an HTML string from the response object? I tried using WebClient.DownloadString() method, but it's just a wrapper around HttpWebRequest that uses streams too.

EDIT: Tried it with Sockets and it's much faster:

static string SocketHTML(string URL)
      {
           string IP = Dns.GetHostAddresses(URL)[0].ToString();
           Socket s = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);
           s.Connect(new IPEndPoint(IPAddress.Parse(IP), 80));
           s.Send(Encoding.ASCII.GetBytes("GET / HTTP/1.1\r\n\r\n"));
           List<byte> HTML = new List<byte>();
           int Bytes = 1;
           while (Bytes > 0)
           {
                byte[] Data = new byte[1024];
                Bytes = s.Receive(Data);
                foreach (byte b in Data) HTML.Add(b);
           }
           s.Close();
           return Encoding.ASCII.GetString(HTML.ToArray());
      }

The problem with using it with Sockets, though, is that it most of the time returns errors such as "Moved Permanently" or "Your browser sent a request that the server could not understand".

BlueRay101
  • 1,447
  • 2
  • 18
  • 29
  • 3
    What are you comparing here? Returning an empty string against a real call to a remote site? – Steve Feb 03 '15 at 10:14
  • I made this comparison to see if the StreamReader.ReadToEnd() is the bottleneck, and I've seen it is. When I receive the response and I don't use the ReadToEnd() method, it takes about 500 MS for GetHTML(string URL) to return, but if I use the ReadToEnd() method it takes 1000 MS. In this case (when I tested on youtube.com), the ReadToEnd() method takes 500 MS to complete - this is very slow. The request itself is fine and is sent OK, but the conversion to a string is very slow. – BlueRay101 Feb 03 '15 at 10:16

3 Answers3

5

When I call this method but return String.Empty instead of the ReadToEnd, the method takes about 500 MS.

All that says is that starting to get the response takes 500ms. Calling GetResponseStream doesn't consume all the data.

ReadToEnd will also be doing conversion from the binary data to text, but I doubt that's significant - I strongly suspect it's just waiting for the data to arrive over the network. To verify that, you should add logging to every aspect of your code and run Wireshark - you should then be able to see packet-by-packet when the data arrives, and correlate it with the logging.

As a side issue, you should definitely have a using statement for the response:

using (var response = ((HttpWebResponse)Request.GetResponse())
{
    // The stream will be disposed when the response is.
    return new StreamReader(response.GetResponseStream())
        .ReadToEnd();
}

If you don't dispose of the response, you'll tie up connections until the garbage collector finalizes them. That can lead to timeouts.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • Thank you for the answer. Would it be faster to use TCP sockets and send directly the "GET / HTTP/1.1\r\n\r\n" data instead? – BlueRay101 Feb 03 '15 at 10:22
  • @BlueRay010: I very much doubt it - not if the time is just being taken by getting the data from server to client. How would using TCP yourself help that? (It would just be the same as what the HttpRequest is doing.) – Jon Skeet Feb 03 '15 at 10:27
  • @BlueRay010: In fact, it would be *worse* to use sockets yourself, as you then wouldn't get connection pooling. However, you've currently got a bug as you're not disposing of the response - see my edit. – Jon Skeet Feb 03 '15 at 10:29
  • You're right, I've seen the ``using`` statement before with HttpWebRequest but I didn't think it's necessary. I added it now and it didn't make it faster, but it'll sure prevent the code from being slower after thousands of requests. – BlueRay101 Feb 03 '15 at 10:35
  • So is HttpWebRequest the fastest way to get the HTML code of a certain URL? – BlueRay101 Feb 03 '15 at 10:35
  • @BlueRay010: You've misread my comment - you should probably do it for the request as well, but you *definitely* need it for the response. As for whether it's the fastest way - you should work out exactly where the bottleneck is. If it's genuinely in the network, then I wouldn't expect to get it any faster any other way. If you find that there's a fair amount of CPU being used converting it to text, then it depends on what you're doing afterwards - dumping it straight to a file in original binary form may be slightly quicker. Consider `WebClient` too for even simpler code. – Jon Skeet Feb 03 '15 at 10:38
  • I avoid using ``WebClient`` since it doesn't have a Timeout property, which I need. In any case, the time the request takes is the same so I think I should stay with my current code for now. Thanks! – BlueRay101 Feb 03 '15 at 10:39
  • 1
    @BlueRay010: Timeout can be added to WebClient fairly easily. See http://stackoverflow.com/a/6994391/18192 . – Brian Feb 03 '15 at 14:29
2

I made this comparison to see if the StreamReader.ReadToEnd() is the bottleneck, and I've seen it is.

You jumped to a wrong conclusion here: the bottleneck is the whole method, not just its StreamReader.ReadToEnd() portion.

When I receive the response and I don't use the ReadToEnd() method, it takes about 500 MS, but if I use the ReadToEnd() method it takes 1000 MS.

That's the thing - an ability to call Response.GetResponseStream() does not mean that you "got a response". All you get is a confirmation that the response is there.

In a real world this would be similar to receiving a parcel for which you must sign at the post office. Post office will put a postcard into your mailbox saying that there is a delivery waiting for you at the post office. That's your Response.GetResponseStream() call. But at this point you do not have your parcel, only a postcard that says the parcel is there. Now you need to go to the post office, show them the card, and retrieve the parcel. That's the StreamReader.ReadToEnd() call.

The time nearly doubles because most of 1000 ms is spent communicating with a remote server. If you need the entire response, there is little you can do about speeding this up. The good news is that since the time is spent in I/O, there is a good chance that you would be able to parallelize this code for retrieving data from multiple web sites (assuming that you do not load your network to capacity).

Sergey Kalinichenko
  • 714,442
  • 84
  • 1,110
  • 1,523
  • Oh, I see, I don't know Input/Output and Streams so well so I thought this is used only for conversion. I already have multi-threaded code, I just thought it could be nice if I can make it slightly faster. Thank you! – BlueRay101 Feb 03 '15 at 10:41
1

It's not the ReadToEnd method that is slow, it's waiting for the data that takes time.

The ReadToEnd method is fast enough. I just tested to read a megabyte of data from a memory stream using a stream reader, and it takes only 3 ms.

When you get the response stream from the request, it has only started to get the data that was requested. Once you have read the data already recieved, it has to wait for the rest of the data to arrive. That's what's taking time in the ReadToEnd call. Using any other means of reading the stream won't make it faster.

Guffa
  • 687,336
  • 108
  • 737
  • 1,005
  • OK, Thank you. Isn't there any way to get the HTML code from a website faster? This is generally very slow, and even in small websites it takes hundreds of milliseconds for a single request. – BlueRay101 Feb 03 '15 at 10:21
  • @BlueRay010 That hundred millisecond is probably the latency between you and the server. – Sriram Sakthivel Feb 03 '15 at 10:23
  • Actually, I tried to send a standard HTTP GET request with a TCP socket, and found out that it takes significantly less time (about 25% the time of the HttpWebRequest class). I know about latency, but why a socket is faster? I don't want to use Sockets, though, because I sometimes get a response such as "Moved permanently" and don't want to deal with it. – BlueRay101 Feb 03 '15 at 10:27
  • 3
    @BlueRay010: Generally you can't speed up a request much. You are at the mercy of the server that is responding, and the network between you and the server. When you send the request using a socket, do you get the same response back? If the server sends you an error message instead, it could be a lot faster than producing a web page. – Guffa Feb 03 '15 at 10:30
  • Yes, I really haven't thought about it being faster because of the server-side error output instead of the page requested. Thanks a lot! – BlueRay101 Feb 03 '15 at 10:37