HTMl agility pack error parsing and returning XElement

Question

I can parse the document and generate an output however the output cannot be parsed into an XElement because of a p tag, everything else within the string is parsed correctly.

My input:

var input = "<p> Not sure why is is null for some wierd reason!<br><br>I have implemented the auto save feature, but does it really work after 100s?<br></p> <p> <i>Autosave?? </i> </p> <p>we are talking...</p><p></p><hr><p><br class=\"GENTICS_ephemera\"></p>";

My code:

public static XElement CleanupHtml(string input)
    {  


    HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();

    htmlDoc.OptionOutputAsXml = true;
    //htmlDoc.OptionWriteEmptyNodes = true;             
    //htmlDoc.OptionAutoCloseOnEnd = true;
    htmlDoc.OptionFixNestedTags = true;

    htmlDoc.LoadHtml(input);

    // ParseErrors is an ArrayList containing any errors from the Load statement
    if (htmlDoc.ParseErrors != null && htmlDoc.ParseErrors.Count() > 0)
    {

    }
    else
    {

        if (htmlDoc.DocumentNode != null)
        {
            var ndoc = new HtmlDocument(); // HTML doc instance
            HtmlNode p = ndoc.CreateElement("body");  

            p.InnerHtml = htmlDoc.DocumentNode.InnerHtml;
            var result = p.OuterHtml.Replace("<br>", "<br/>");
            result = result.Replace("<br class=\"special_class\">", "<br/>");
            result = result.Replace("<hr>", "<hr/>");
            return XElement.Parse(result, LoadOptions.PreserveWhitespace);
        }
    }
    return new XElement("body");

}

My output:

<body>
   <p> Not sure why is is null for some wierd reason chappy!
   <br/>
   <br/>I have implemented the auto save feature, but does it really work after 100s?
   <br/>
   </p> 
   <p> 
   <i>Autosave?? </i> 
   </p> 
   <p>we are talking...</p>
   **<p>**
   <hr/>
   <p>
   <br/>
   </p>
</body>

The bold p tag is the one that did not output correctly... Is there a way around this? Am I doing something wrong with the code?

score 9 · Accepted Answer · answered Mar 18 '11 at 08:04

What you are trying to do is basically transform an Html input into an Xml output.

Html Agility Pack can do that when you use the OptionOutputAsXml option, but in this case, you should not use the InnerHtml property, and instead let the Html Agility Pack do the ground work for you, with one of HtmlDocument's Save methods.

Here is a generic function to convert an Html text to an XElement instance:

public static XElement HtmlToXElement(string html)
{
    if (html == null)
        throw new ArgumentNullException("html");

    HtmlDocument doc = new HtmlDocument();
    doc.OptionOutputAsXml = true;
    doc.LoadHtml(html);
    using (StringWriter writer = new StringWriter())
    {
        doc.Save(writer);
        using (StringReader reader = new StringReader(writer.ToString()))
        {
            return XElement.Load(reader);
        }
    }
}

As you see, you don't have to do much work by yourself! Please note that since your original input text has no root element, the Html Agility Pack will automatically add one enclosing SPAN to ensure the output is valid Xml.

In your case, you want to additionnally process some tags, so, here is how to do with your exemple:

    public static XElement CleanupHtml(string input)
    {
        if (input == null)
            throw new ArgumentNullException("input");

        HtmlDocument doc = new HtmlDocument();
        doc.OptionOutputAsXml = true;
        doc.LoadHtml(input);

        // extra processing, remove some attributes using DOM
        HtmlNodeCollection coll = doc.DocumentNode.SelectNodes("//br[@class='special_class']");
        if (coll != null)
        {
            foreach (HtmlNode node in coll)
            {
                node.Attributes.Remove("class");
            }
        }

        using (StringWriter writer = new StringWriter())
        {
            doc.Save(writer);
            using (StringReader reader = new StringReader(writer.ToString()))
            {
                return XElement.Load(reader);
            }
        }
    }

As you see, you should not use raw string function, but instead use the Html Agility Pack DOM functions (SelectNodes, Add, Remove, etc...).

This works, very strange why I have to save to get the correct output, anyway - how would I handle nbsp; if it was contained in the input? Would you recommend I use the anti.xss library alongside this? — Haroon, Mar 18 '11 at 11:38
+1 I didn't even know about `OptionOutputAsXml` (and its use case) — BrokenGlass, Mar 18 '11 at 13:13
Doesn't seem like HtmlAgilityPack is particularly reliable in its conversion, e.g. I get this error: 6XmlException '', hexadecimal value 0x03, is an invalid character. Line 2081, position 822. LineNumber 2081 LinePosition 822 — Bent Rasmussen, Dec 25 '11 at 11:16
Too much overhead using both `StringWriter` and `StringReader`. Just use a `MemoryStream` and reset the position. It's better than allocating that temporary string using `ToString()` — Baccata, Oct 10 '22 at 10:44

score 2 · Answer 2 · edited May 23 '17 at 12:06

2

If you check the documentation comments for OptionFixNestedTags you will see the following:

//     Defines if LI, TR, TH, TD tags must be partially fixed when nesting errors
//     are detected. Default is false.

So I don't think this will help you with unclosed HTML p tags. According to an old SO question C# library to clean up html though HTML Tidy might work for this purpose.

edited May 23 '17 at 12:06

Community

1
1

answered Mar 17 '11 at 17:43

BrokenGlass

158,293
28
286
335

thanks for the info... the problem with my input is I have a valid tag but it is not processed correctly, they are just empty elements! changes into – Haroon Mar 18 '11 at 09:06

HTMl agility pack error parsing and returning XElement

2 Answers2