0

I am working on some scraping app, i wanted to try to get it to work but ran into a problem. I have replaced the original scraping destination in the below code with googles webpage, just for testing. It seems that my download doesnt get everything, i note that the body and the html tags are missing their close tags. How do i get it to download everything? Whats wrong with my sample code:

string filename = "test.html";

WebClient client = new WebClient();            
string searchTerm = HttpUtility.UrlEncode(textBox2.Text);            
client.QueryString.Add("q", searchTerm);
client.QueryString.Add("hl", "en");
string data = client.DownloadString("http://www.google.com/search");

StreamWriter writer = new StreamWriter(filename, false, Encoding.Unicode);
writer.Write(data);
writer.Flush();
writer.Close();
balexandre
  • 73,608
  • 45
  • 233
  • 342
Brian Hvarregaard
  • 4,081
  • 6
  • 41
  • 77

3 Answers3

4

Google's web pages are now in HTML 5, meaning the BODY and HTML tags can be self-closed - which is why Google omits them (believe it or not, it saves them bandwidth.)

See this article.

You can write HTML5 in either "HTML/SGML" mode (which allows the omitting of closing tags like HTML did prior to XHTML) or in "XHTML" which follows the rules of XML, requiring all tags to be closed.

Which the browser chooses to parse the page depends on whether you send a Content-type header of text/html for HTML/SGML syntax or application/xhtml+xml for XHTML syntax. (Source: HTML5 syntax - HTML vs XHTML)

Community
  • 1
  • 1
Andy Shellam
  • 15,403
  • 1
  • 27
  • 41
0

...Google's page doesn't have the closing tags for <body> and <html>. Talk about crazy optimization...

Matti Virkkunen
  • 63,558
  • 9
  • 127
  • 159
0

http://www.google.com/search doesn't have closing tags.

Marcelo Cantos
  • 181,030
  • 38
  • 327
  • 365
  • @walther: The OP was complaining that close tags weren't being downloaded. I explained that they're not there to be downloaded. How is that not an answer? – Marcelo Cantos Aug 23 '12 at 23:52
  • Well, it's the same kind of answer like when you have a question "how can I select an item in GridView?" and you reply with "yes, you can!". You're stating the obvious here without any further explanation on what's going on. That's why I don't find your post very useful. That's all to it ;-) Nothing personal. – walther Aug 24 '12 at 00:13
  • @walther: No worries, I don't take criticisms personally; I just don't agree with your assessment. Your analogy doesn't fit because the absence of closing tags wasn't at all obvious to the OP, who thought that their code was somehow dropping them. If the OP had asked why Google leaves out the tags, your criticism would have been well-founded (but then I wouldn't have answered in this fashion to begin with). – Marcelo Cantos Aug 24 '12 at 00:23