Convert all HTML entities not predefined for XML to unicode

Question

I am trying to manipulate a string containing HTML-Code and then save the content to a htm-file. Afterwards the htm file is imported to a Word-File. Goal is to append a document formatted in HTML to a Word document. This process is part of a much larger programm and i cannot modify the given parameters.

To easily modify the HTML-Code I thought using XDocument would be a great idea.
So I tried this:

AppendContent(string content, Document doc)
{
    string filePath = ...; //somewhere in /AppData/Local

    var xDoc = XDocument.Parse(content);

    // code left out because irrelevant    
    // Finding all "img" elements, in order to 
    // extract the embedded picture and save it as external file

    FileHelper.SaveToFile(filePath, xDoc.ToString());
    //... After this, the file is appended to the word file (the one in doc)
}

First attempt worked actually, with a small test html. Using any of the big documents I'm trying to append to the word document, cause an exception to be thrown:

XDocument.Parse cannot parse entities like "nbsp" or "uuml" (german ü). I already found out that XML only supports a hand full of predefined entities, so i would have to manually add the definition to the html file. This is not an option, because this operation is supposed to work with ANY Html file.

I found following fix:

var decodedContent = WebUtility.HtmlDecode(content);
var xDoc = XDocument.Parse(decodedContent);

This converts all entities to the representing character. So "uuml" is converted to "ü", etc. This worked until i hit a document that contained the "amp" entity, which is then converted to "&"... and such the XDocument.Parse is complaining again.

I'm looking for a way to convert HTML to unicode-representation ("\0x1234") or a HTML-decode, that does not decode XML-predefined entities.

possible duplicate of [Converting Html utf-8 charset to ISO-8859-1 via C#](http://stackoverflow.com/questions/11363589/converting-html-utf-8-charset-to-iso-8859-1-via-c-sharp) — Paul Sweatte, Sep 07 '15 at 06:10

Convert all HTML entities not predefined for XML to unicode

0 Answers0