I'm making a Web Crawler and I just found out that one of my methods, GetHTML, is very slow because it uses a StreamReader to get a string of the HTML out of the HttpWebResponse object.
Here is the method:
static string GetHTML(string URL)
{
HttpWebRequest Request = (HttpWebRequest)WebRequest.Create(URL);
Request.Proxy = null;
HttpWebResponse Response = ((HttpWebResponse)Request.GetResponse());
Stream RespStream = Response.GetResponseStream();
return new StreamReader(RespStream).ReadToEnd(); // Very slow
}
I made a test with Stopwatch and used this method on YouTube.
Time it takes to get an HTTP response: 500 MS
Time it takes to convert the HttpWebResponse object to a string: 550 MS
So the HTTP request is fine, it's just the ReadToEnd() that is so slow.
Is there any alternative to the ReadToEnd() method to get an HTML string from the response object? I tried using WebClient.DownloadString() method, but it's just a wrapper around HttpWebRequest that uses streams too.
EDIT: Tried it with Sockets and it's much faster:
static string SocketHTML(string URL)
{
string IP = Dns.GetHostAddresses(URL)[0].ToString();
Socket s = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);
s.Connect(new IPEndPoint(IPAddress.Parse(IP), 80));
s.Send(Encoding.ASCII.GetBytes("GET / HTTP/1.1\r\n\r\n"));
List<byte> HTML = new List<byte>();
int Bytes = 1;
while (Bytes > 0)
{
byte[] Data = new byte[1024];
Bytes = s.Receive(Data);
foreach (byte b in Data) HTML.Add(b);
}
s.Close();
return Encoding.ASCII.GetString(HTML.ToArray());
}
The problem with using it with Sockets, though, is that it most of the time returns errors such as "Moved Permanently" or "Your browser sent a request that the server could not understand".