XML: how to pre-parse when only SOME data is escaped?

Question

XML snippet:

<field>&amp; is escaped</field>
<field>&quot;also escaped&quot;</field>
<field>is & "not" escaped</field>
<field>is &quot; and is not & escaped</field>

I'm looking for suggestions on how I could go about pre-parsing any XML to escape everything not escaped prior to running the XML through a parser?

I do not have control over the XML being passed to me, they likely won't fix it anytime soon, and I have to find a way to parse it.

The primary issue I'm running into is that running the XML as is into a parser, such as (below) will throw an exception due to the XML being bad due to some of it not being escaped properly

string xml = "<field>& is not escaped</field>";
XmlReader.Create(new StringReader(xml))

possibly duplicate of https://stackoverflow.com/questions/8331119/escape-invalid-xml-characters-in-c-sharp — Steve, Jun 19 '17 at 15:21
the answer to that question involves "removing" bad chars instead of fixing/escaping them, which would not be ideal. Looking for the ability to escape the & instead of removing the & when it is passed over as unescaped — SED, Jun 19 '17 at 15:26
Use System.Net.WebUtility.HtmlDecode or System.Net.WebUtility.HtmlEncode. See wiki : https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references. The file looks like it was modified to work with html and simply needs to be converted back to xml. — jdweng, Jun 19 '17 at 15:29
@jdweng don't think that'd help - you'd escape `"` to `"`, which isn't what's wanted. — Charles Mager, Jun 19 '17 at 15:33
You could try a DGML parser instead, which I've found to be a bit more tolerant of structure/encoding issues than plain XML parsers. I've run into issues in the past where I would get incomplete XML passed back from F5 devices and the DGML parsers were the only ones apparently capable of figuring out the mistakes and getting around them. Sorry I can't remember the DGML parser I used though, it's been about 5 years. — William Holroyd, Jun 19 '17 at 15:38

score 3 · Accepted Answer · answered Jun 19 '17 at 15:44

I'd suggest you use a Regex to replace un-escaped ampersands with their entity equivalent.

This question is helpful as it gives you a Regex to find these rogue ampersands:

&(?!(?:apos|quot|[gl]t|amp);|#)

And you can see that it matches the correct text in this demo. You can use this in a simple replace operation:

var escXml = Regex.Replace(xml, "&(?!(?:apos|quot|[gl]t|amp);|#)", "&amp;");

And then you'll be able to parse your XML.

score 0 · Answer 2 · answered Jun 19 '17 at 15:47

0

Preprocess the textual data (not really XML) with HTML Tidy with quote-ampersand set to true.

answered Jun 19 '17 at 15:47

kjhughes

106,133
27
181
240

score 0 · Answer 3 · answered Jun 19 '17 at 15:47

If you want to parse something that isn't XML, you first need to decide exactly what this language is and what you intend to do with it: when you've written a grammar for the non-XML language that you intend to process, you can then decide whether it's possible to handle it by preprocessing or whether you need a full-blown parser.

For example, if you only need to handle an unescaped "&" that's followed by a space, and if you don't care about what happens inside comments and CDATA sections, then it's a fairly easy problem. If you don't want to corrupt the contents of comments or CDATA, or if you need to handle things like   when there's no definition of &npsp;, then life starts to become rather more difficult.

Of course, you and your supplier could save yourselves a great deal of time and expense if you wrote software that conformed to standards. That's what standards are for.

XML: how to pre-parse when only SOME data is escaped?

3 Answers3