0

I have a project where I am outputting content from a CMS into XML.. I don't fully control the content the CMS, and we now have a problem that certain content doesn't fully conform to XML

    <Block PageGuid="xxx" PageId="1234" PageType="block" PageName="blockpage" PageUrl="/en/New-Folder7/New-Folder8/" CreateBlock="false">
  <Properties>
    <Property PropertyName="EmbedCode" Ignore="false" DefaultLanguageChanged="true" TranslatedChanged="true">
      <DefaultLanguage><DIV id=TA_sss class=TA_sss><UL id=sdfsdfsdfsdf class="TA_links xx"><LI id=sdfsdfsf class=sdfsfsf><A href="http://www.tripadvisor.co.uk/">xxxxxxxxx</A></LI></UL></DIV><SCRIPT src="http://www.jscache.com/"></SCRIPT></DefaultLanguage>
      <Translation><DIV id=TA_sss class=TA_sss><UL id=xxxx class='TA_links xxx'><LI id=xxxx class=xxxx><A href='http://www.tripadvisor.co.uk/'>xxxxxxxxx</A></LI></UL></DIV><SCRIPT src='http://www.jscache.com/'></SCRIPT></Translation>
      <PreviousValues>
        <PreviousDefaultText></PreviousDefaultText>
        <PreviousTranslationText></PreviousTranslationText>
      </PreviousValues>
    </Property>
  </Properties>
</Block>

See the above XML.. I need to find any cases where I have an attribute with a missing quote, adding the in:

i.e.

And cases where they are single quotes, replacing with double quotes

i.e.

http://www.tripadvisor.co.uk/'>

I have the entire XML in a string, so I am hoping there is a Regex I can use to do this?

My solution:

            var reader = new StringReader(xml);
        var sgmlReader = new Sgml.SgmlReader
                             {
                                 DocType = "HTML",
                                 WhitespaceHandling = WhitespaceHandling.All,
                                 CaseFolding = Sgml.CaseFolding.ToLower,
                                 InputStream = reader
                             };
        var doc = new XmlDocument { PreserveWhitespace = true, XmlResolver = null };
        doc.Load(sgmlReader);
mp3duck
  • 2,633
  • 7
  • 26
  • 40
  • See: http://stackoverflow.com/a/1732454/2424 – NotMe Nov 21 '13 at 16:26
  • Side note: [XML can't be valid HTML](http://mitchfincher.blogspot.com/2011/12/html5-is-not-xml-time-to-get-over-it.html) - be sure to test that all cases you care about are handled properly. I.e. some tags should not be closed like `BR`, `IMG` also most browsers will ignore self-closing for these tags. – Alexei Levenkov Nov 21 '13 at 17:12

2 Answers2

3

I've used https://github.com/MindTouch/SGMLReader in the past to solve a similar issue. Worked like a charm (YMMV).

Mark
  • 1,360
  • 1
  • 8
  • 14
  • 1
    That totally worked for me! Thanks for much.. As it happened, SGMLReader was even built into my CMS (EPiServer).. My fixing code is posted above – mp3duck Nov 21 '13 at 16:34
2

You may try Html Agility Pack. Quoting the parts that may interest you:

'The parser is very tolerant with "real world" malformed HTML'

and

'Sample applications: Page fixing or generation'

So there you go. Load the XML, generate a 'proper' render, pass it along.

OnoSendai
  • 3,960
  • 2
  • 22
  • 46