0

XML snippet:

<field>&amp; is escaped</field>
<field>&quot;also escaped&quot;</field>
<field>is & "not" escaped</field>
<field>is &quot; and is not & escaped</field>

I'm looking for suggestions on how I could go about pre-parsing any XML to escape everything not escaped prior to running the XML through a parser?

I do not have control over the XML being passed to me, they likely won't fix it anytime soon, and I have to find a way to parse it.

The primary issue I'm running into is that running the XML as is into a parser, such as (below) will throw an exception due to the XML being bad due to some of it not being escaped properly

string xml = "<field>& is not escaped</field>";
XmlReader.Create(new StringReader(xml))
SED
  • 311
  • 1
  • 11
  • 1
    possibly duplicate of https://stackoverflow.com/questions/8331119/escape-invalid-xml-characters-in-c-sharp – Steve Jun 19 '17 at 15:21
  • the answer to that question involves "removing" bad chars instead of fixing/escaping them, which would not be ideal. Looking for the ability to escape the & instead of removing the & when it is passed over as unescaped – SED Jun 19 '17 at 15:26
  • 1
    Use System.Net.WebUtility.HtmlDecode or System.Net.WebUtility.HtmlEncode. See wiki : https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references. The file looks like it was modified to work with html and simply needs to be converted back to xml. – jdweng Jun 19 '17 at 15:29
  • 1
    @jdweng don't think that'd help - you'd escape `"` to `&quot;`, which isn't what's wanted. – Charles Mager Jun 19 '17 at 15:33
  • 1
    You could try a DGML parser instead, which I've found to be a bit more tolerant of structure/encoding issues than plain XML parsers. I've run into issues in the past where I would get incomplete XML passed back from F5 devices and the DGML parsers were the only ones apparently capable of figuring out the mistakes and getting around them. Sorry I can't remember the DGML parser I used though, it's been about 5 years. – William Holroyd Jun 19 '17 at 15:38

3 Answers3

3

I'd suggest you use a Regex to replace un-escaped ampersands with their entity equivalent.

This question is helpful as it gives you a Regex to find these rogue ampersands:

&(?!(?:apos|quot|[gl]t|amp);|#)

And you can see that it matches the correct text in this demo. You can use this in a simple replace operation:

var escXml = Regex.Replace(xml, "&(?!(?:apos|quot|[gl]t|amp);|#)", "&amp;");

And then you'll be able to parse your XML.

Charles Mager
  • 25,735
  • 2
  • 35
  • 45
0

Preprocess the textual data (not really XML) with HTML Tidy with quote-ampersand set to true.

kjhughes
  • 106,133
  • 27
  • 181
  • 240
0

If you want to parse something that isn't XML, you first need to decide exactly what this language is and what you intend to do with it: when you've written a grammar for the non-XML language that you intend to process, you can then decide whether it's possible to handle it by preprocessing or whether you need a full-blown parser.

For example, if you only need to handle an unescaped "&" that's followed by a space, and if you don't care about what happens inside comments and CDATA sections, then it's a fairly easy problem. If you don't want to corrupt the contents of comments or CDATA, or if you need to handle things like &nbsp; when there's no definition of &npsp;, then life starts to become rather more difficult.

Of course, you and your supplier could save yourselves a great deal of time and expense if you wrote software that conformed to standards. That's what standards are for.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164