6

I can parse the document and generate an output however the output cannot be parsed into an XElement because of a p tag, everything else within the string is parsed correctly.

My input:

var input = "<p> Not sure why is is null for some wierd reason!<br><br>I have implemented the auto save feature, but does it really work after 100s?<br></p> <p> <i>Autosave?? </i> </p> <p>we are talking...</p><p></p><hr><p><br class=\"GENTICS_ephemera\"></p>";

My code:

public static XElement CleanupHtml(string input)
    {  


    HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();

    htmlDoc.OptionOutputAsXml = true;
    //htmlDoc.OptionWriteEmptyNodes = true;             
    //htmlDoc.OptionAutoCloseOnEnd = true;
    htmlDoc.OptionFixNestedTags = true;

    htmlDoc.LoadHtml(input);

    // ParseErrors is an ArrayList containing any errors from the Load statement
    if (htmlDoc.ParseErrors != null && htmlDoc.ParseErrors.Count() > 0)
    {

    }
    else
    {

        if (htmlDoc.DocumentNode != null)
        {
            var ndoc = new HtmlDocument(); // HTML doc instance
            HtmlNode p = ndoc.CreateElement("body");  

            p.InnerHtml = htmlDoc.DocumentNode.InnerHtml;
            var result = p.OuterHtml.Replace("<br>", "<br/>");
            result = result.Replace("<br class=\"special_class\">", "<br/>");
            result = result.Replace("<hr>", "<hr/>");
            return XElement.Parse(result, LoadOptions.PreserveWhitespace);
        }
    }
    return new XElement("body");

}

My output:

<body>
   <p> Not sure why is is null for some wierd reason chappy!
   <br/>
   <br/>I have implemented the auto save feature, but does it really work after 100s?
   <br/>
   </p> 
   <p> 
   <i>Autosave?? </i> 
   </p> 
   <p>we are talking...</p>
   **<p>**
   <hr/>
   <p>
   <br/>
   </p>
</body>

The bold p tag is the one that did not output correctly... Is there a way around this? Am I doing something wrong with the code?

Haroon
  • 3,402
  • 6
  • 43
  • 74

2 Answers2

9

What you are trying to do is basically transform an Html input into an Xml output.

Html Agility Pack can do that when you use the OptionOutputAsXml option, but in this case, you should not use the InnerHtml property, and instead let the Html Agility Pack do the ground work for you, with one of HtmlDocument's Save methods.

Here is a generic function to convert an Html text to an XElement instance:

public static XElement HtmlToXElement(string html)
{
    if (html == null)
        throw new ArgumentNullException("html");

    HtmlDocument doc = new HtmlDocument();
    doc.OptionOutputAsXml = true;
    doc.LoadHtml(html);
    using (StringWriter writer = new StringWriter())
    {
        doc.Save(writer);
        using (StringReader reader = new StringReader(writer.ToString()))
        {
            return XElement.Load(reader);
        }
    }
}

As you see, you don't have to do much work by yourself! Please note that since your original input text has no root element, the Html Agility Pack will automatically add one enclosing SPAN to ensure the output is valid Xml.

In your case, you want to additionnally process some tags, so, here is how to do with your exemple:

    public static XElement CleanupHtml(string input)
    {
        if (input == null)
            throw new ArgumentNullException("input");

        HtmlDocument doc = new HtmlDocument();
        doc.OptionOutputAsXml = true;
        doc.LoadHtml(input);

        // extra processing, remove some attributes using DOM
        HtmlNodeCollection coll = doc.DocumentNode.SelectNodes("//br[@class='special_class']");
        if (coll != null)
        {
            foreach (HtmlNode node in coll)
            {
                node.Attributes.Remove("class");
            }
        }

        using (StringWriter writer = new StringWriter())
        {
            doc.Save(writer);
            using (StringReader reader = new StringReader(writer.ToString()))
            {
                return XElement.Load(reader);
            }
        }
    }

As you see, you should not use raw string function, but instead use the Html Agility Pack DOM functions (SelectNodes, Add, Remove, etc...).

Simon Mourier
  • 132,049
  • 21
  • 248
  • 298
  • This works, very strange why I have to save to get the correct output, anyway - how would I handle nbsp; if it was contained in the input? Would you recommend I use the anti.xss library alongside this? – Haroon Mar 18 '11 at 11:38
  • +1 I didn't even know about `OptionOutputAsXml` (and its use case) – BrokenGlass Mar 18 '11 at 13:13
  • Doesn't seem like HtmlAgilityPack is particularly reliable in its conversion, e.g. I get this error: 6XmlException '', hexadecimal value 0x03, is an invalid character. Line 2081, position 822. LineNumber 2081 LinePosition 822 – Bent Rasmussen Dec 25 '11 at 11:16
  • 2
    Please post a new question, if it's a new problem. – Simon Mourier Dec 25 '11 at 15:16
  • Too much overhead using both `StringWriter` and `StringReader`. Just use a `MemoryStream` and reset the position. It's better than allocating that temporary string using `ToString()` – Baccata Oct 10 '22 at 10:44
2

If you check the documentation comments for OptionFixNestedTags you will see the following:

//     Defines if LI, TR, TH, TD tags must be partially fixed when nesting errors
//     are detected. Default is false.

So I don't think this will help you with unclosed HTML p tags. According to an old SO question C# library to clean up html though HTML Tidy might work for this purpose.

Community
  • 1
  • 1
BrokenGlass
  • 158,293
  • 28
  • 286
  • 335
  • thanks for the info... the problem with my input is I have a valid < p > < / p> tag but it is not processed correctly, they are just empty elements! < p > < / p> changes into < p > – Haroon Mar 18 '11 at 09:06