1

My original issue is that I am trying to serialize a string containing html tags to an XML element.

hello <a href="world.php">World</a>, this

is
a nice
test

<ul>
  <li>to demonstrate my issue</li>
  <li>and find a solution</li>
</ul>

However, I have 2 issues

  1. Serializing HTML to XML: I did not succeed in defining the Serializable class to correctly serialize with XmlSerialze, so I decided that, using CDATA sections might be the better way. This is however not correctly deserialized by the target tool (that I have no influence on). What I need is plain and correct html (XHMTL?) within the xml output file.


2. The string looks e.g. as above, but is not fully correct html (no <p> tags, no <br> tags). Now I would like to replace the newlines by a p or br tag. I have had a look here and used the suggested solution:
    string result = "<p>" + text
     .Replace(Environment.NewLine + Environment.NewLine, "</p><p>")
     .Replace(Environment.NewLine, "<br />")
     .Replace("</p><p>", "</p>" + Environment.NewLine + "<p>") + "</p>";

However, this does not in all cases generate valid html. In the example above, it would create <br />s between the <li> tags or cause <ul> tags within <p> tags - which is both not allowed.

Target would be to have a result like the following (line breaks are only for better readability and don't matter here)

<p>hello <a href="world.php">World</a>, this</p>
<p>is<br/>
a nice<br/>
test<br/></p>
<ul>
  <li>to demonstrate my issue</li>
  <li>and find a solution</li>
</ul>

Do you have any suggestion how to solve this either with a string.Replace, Regex, or better solution (HtmlDocument)?

Please note: I have no influence on deserialization, the XML output is evaluated by I tool I have no influence on, and it has to be UTF-8 encoded.

Thank you!

EDIT: Clearly separated the 2 issues

EDIT2: No influence on deserialization

EDIT3: Added target output

Community
  • 1
  • 1
nogenius
  • 574
  • 1
  • 6
  • 18

2 Answers2

3

What you're trying to do is implement a "tag soup parser", which takes text that may or may not be HTML as input and transforms that into a valid DOM, that a HTML parser can handle.

You don't want to reinvent this wheel, most definitely not with simple string replaces. See How to parse bad html? for some hints.


Or you can just encode the input HTML in such a way that it doesn't interfere with the XML that you're trying to put it in, like a CDATA section or base64-encoding the input would also suffice. Don't use "entity encoding", as your XML parser is going to complain about HTML entities that aren't XML entities.

Community
  • 1
  • 1
CodeCaster
  • 147,647
  • 23
  • 218
  • 272
  • Here the HtmlAgilityPack is mentioned, however I can't find any proper documentation. Does it provide a way to correct unallowed sequences or nestings (e.g. ul within p or br within ul)? – nogenius Sep 08 '15 at 13:15
0

I've had to do similar (ensuring 3rd party content has valid HTML). If I was doing this, I'd do the following:

1) Replace line breaks with HTML line breaks

string result = text.Replace(Environment.NewLine, "<br />");

2) Use the HTMLAgility pack to fix any invalid HTML

    var doc = new HtmlDocument();
    HtmlNode.ElementsFlags["p"] = HtmlElementFlag.Closed;
    doc.OptionFixNestedTags = false;
    doc.LoadHtml(result);

    if (doc.ParseErrors.Count() > 0)
    {
                // throw error
    }else{
                // get fixed html
                 result= doc.DocumentNode.OuterHtml;
    }
Jack
  • 425
  • 4
  • 12
  • This does not fully solve my issue: 1. This does not consider paragraphs and adds `
    ` tags also within `
      `s (which is not valid) 2. This does neither correct `
        `s within `

        ` tags, nor corrects `
        ` tags within `

          `s. This behavior is the same even when I use `doc.OptionFixNestedTags = true;`
    – nogenius Sep 08 '15 at 13:26
  • can you provide a sample of the input and output? It should remove the invalid
    's added by the replace...
    – Jack Sep 08 '15 at 15:31
  • For the input, see above, and also note the newlines that are not marked with `
    `s or `

    `s. Therefore the newlines are not considered in the target tool. I will update my post to show the target output

    – nogenius Sep 08 '15 at 15:53