Using C# to convert incorrect html string to real html

Question

My original issue is that I am trying to serialize a string containing html tags to an XML element.

hello <a href="world.php">World</a>, this

is
a nice
test

<ul>
  <li>to demonstrate my issue</li>
  <li>and find a solution</li>
</ul>

However, I have 2 issues

Serializing HTML to XML: I did not succeed in defining the Serializable class to correctly serialize with XmlSerialze, so I decided that, using CDATA sections might be the better way. This is however not correctly deserialized by the target tool (that I have no influence on). What I need is plain and correct html (XHMTL?) within the xml output file.

2. The string looks e.g. as above, but is not fully correct html (no <p> tags, no <br> tags). Now I would like to replace the newlines by a p or br tag. I have had a look here and used the suggested solution:

    string result = "<p>" + text
     .Replace(Environment.NewLine + Environment.NewLine, "</p><p>")
     .Replace(Environment.NewLine, "<br />")
     .Replace("</p><p>", "</p>" + Environment.NewLine + "<p>") + "</p>";

However, this does not in all cases generate valid html. In the example above, it would create <br />s between the <li> tags or cause <ul> tags within <p> tags - which is both not allowed.

Target would be to have a result like the following (line breaks are only for better readability and don't matter here)

<p>hello <a href="world.php">World</a>, this</p>
<p>is<br/>
a nice<br/>
test<br/></p>
<ul>
  <li>to demonstrate my issue</li>
  <li>and find a solution</li>
</ul>

Do you have any suggestion how to solve this either with a string.Replace, Regex, or better solution (HtmlDocument)?

Please note: I have no influence on deserialization, the XML output is evaluated by I tool I have no influence on, and it has to be UTF-8 encoded.

Thank you!

EDIT: Clearly separated the 2 issues

EDIT2: No influence on deserialization

EDIT3: Added target output

Is your question _"How can I transport HTML inside XML"_, or _"How to make valid HTML from invalid HTML"_? Or "whichever is easier"? — CodeCaster, Sep 07 '15 at 12:48
Valid question, actually both but with more priority on the 2nd. I have separated the 2 topics above — nogenius, Sep 07 '15 at 13:09

score 3 · Answer 1 · edited May 23 '17 at 12:22

3

What you're trying to do is implement a "tag soup parser", which takes text that may or may not be HTML as input and transforms that into a valid DOM, that a HTML parser can handle.

You don't want to reinvent this wheel, most definitely not with simple string replaces. See How to parse bad html? for some hints.

Or you can just encode the input HTML in such a way that it doesn't interfere with the XML that you're trying to put it in, like a CDATA section or base64-encoding the input would also suffice. Don't use "entity encoding", as your XML parser is going to complain about HTML entities that aren't XML entities.

edited May 23 '17 at 12:22

Community

1
1

answered Sep 07 '15 at 12:46

CodeCaster

147,647
23
218
272

Here the HtmlAgilityPack is mentioned, however I can't find any proper documentation. Does it provide a way to correct unallowed sequences or nestings (e.g. ul within p or br within ul)? – nogenius Sep 08 '15 at 13:15

score 0 · Answer 2 · answered Sep 07 '15 at 12:58

0

I've had to do similar (ensuring 3rd party content has valid HTML). If I was doing this, I'd do the following:

1) Replace line breaks with HTML line breaks

string result = text.Replace(Environment.NewLine, "<br />");

2) Use the HTMLAgility pack to fix any invalid HTML

    var doc = new HtmlDocument();
    HtmlNode.ElementsFlags["p"] = HtmlElementFlag.Closed;
    doc.OptionFixNestedTags = false;
    doc.LoadHtml(result);

    if (doc.ParseErrors.Count() > 0)
    {
                // throw error
    }else{
                // get fixed html
                 result= doc.DocumentNode.OuterHtml;
    }

answered Sep 07 '15 at 12:58

Jack

425
4
12

This does not fully solve my issue: 1. This does not consider paragraphs and adds `
` tags also within `
– nogenius Sep 08 '15 at 13:26
can you provide a sample of the input and output? It should remove the invalid
's added by the replace... – Jack Sep 08 '15 at 15:31
For the input, see above, and also note the newlines that are not marked with `
`s or `
`s. Therefore the newlines are not considered in the target tool. I will update my post to show the target output
– nogenius Sep 08 '15 at 15:53

Using C# to convert incorrect html string to real html

2 Answers2