3

I use the following code to translate the HTTP response stream into a XmlDocument.

HttpWebRequest request = WebRequest.Create(url) as HttpWebRequest;
HttpWebResponse response = request.GetResponse() as HttpWebResponse;
Stream responseStream = response.GetResponseStream();
StreamReader responseReader = new StreamReader(responseStream);
String responseString = responseReader.ReadToEnd();
Console.WriteLine(responseString);
Int32 htmlTagIndex = responseString.IndexOf("<html",
   StringComparison.OrdinalIgnoreCase);
XmlDocument responseXhtml = new XmlDocument();
responseString = responseString.Substring(htmlTagIndex); // MARK 1
responseString = responseString.Replace("&nbsp", " "); // MARK 2
responseXhtml.LoadXml(responseString);
return responseXhtml;

The MARK 1 line is to skip the DOC Type definition line.

The MARK 2 line is to avoid the error Reference to undeclared entity 'nbsp'.

Is there any better way to do this? There're too much string operation in the above code.

Thanks!

smwikipedia
  • 61,609
  • 92
  • 309
  • 482

1 Answers1

6

I would directly use HtmlAgilityPack to parse the html. Even if you have to convert html to xml, you can use it.

using (WebClient wc = new WebClient())
{
    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(wc.DownloadString("http://www.google.com"));
    doc.OptionOutputAsXml = true;

    StringWriter writer = new StringWriter();
    doc.Save(writer);

    var xDoc = XDocument.Load(new StringReader(writer.ToString()));
}
carla
  • 1,970
  • 1
  • 31
  • 44
L.B
  • 114,136
  • 19
  • 178
  • 224