XmlDocument failed to load XHTML string because of error "Reference to undeclared entity 'nbsp'"

Question

I use the following code to translate the HTTP response stream into a XmlDocument.

HttpWebRequest request = WebRequest.Create(url) as HttpWebRequest;
HttpWebResponse response = request.GetResponse() as HttpWebResponse;
Stream responseStream = response.GetResponseStream();
StreamReader responseReader = new StreamReader(responseStream);
String responseString = responseReader.ReadToEnd();
Console.WriteLine(responseString);
Int32 htmlTagIndex = responseString.IndexOf("<html",
   StringComparison.OrdinalIgnoreCase);
XmlDocument responseXhtml = new XmlDocument();
responseString = responseString.Substring(htmlTagIndex); // MARK 1
responseString = responseString.Replace("&nbsp", " "); // MARK 2
responseXhtml.LoadXml(responseString);
return responseXhtml;

The MARK 1 line is to skip the DOC Type definition line.

The MARK 2 line is to avoid the error Reference to undeclared entity 'nbsp'.

Is there any better way to do this? There're too much string operation in the above code.

Thanks!

HTML Agility Pack: http://stackoverflow.com/questions/56107/what-is-the-best-way-to-parse-html-in-c — RB., Oct 10 '12 at 15:11
Thanks, but the HTML Agility Pack seems to be an over-kill. Any simple code? — smwikipedia, Oct 10 '12 at 15:13
(X)HTML is usually not XML. ` ` is an entity defined in HTML. Do you really need to load this as XML? — CodeCaster, Oct 10 '12 at 15:13

score 6 · Accepted Answer · edited Nov 28 '17 at 18:12

I would directly use HtmlAgilityPack to parse the html. Even if you have to convert html to xml, you can use it.

using (WebClient wc = new WebClient())
{
    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(wc.DownloadString("http://www.google.com"));
    doc.OptionOutputAsXml = true;

    StringWriter writer = new StringWriter();
    doc.Save(writer);

    var xDoc = XDocument.Load(new StringReader(writer.ToString()));
}

XmlDocument failed to load XHTML string because of error "Reference to undeclared entity 'nbsp'"

1 Answers1