Parse XML with ampersand (&) and <, > with c# on .net framework (can't control source)

Question

we have a series of XML data files coming from mainframe programs. These are parsed by .Net processes downstream. Some of the inner text fields contain characters like & that need to be escaped

Unfortunately, we can't actually fix all the programs. When a bad &,>, or < comes in .. the fix is to ask the users to edit the mainframe data and spell out the characters! So solutions like this answer won't work.

Some of the program escape their xml properly, e.g., they will replace & in the data with & and escape the < and > as well. So solutions like this answer won't work either!

One thing I could do is to write a preprocessor that follows rules like this:

amp Strategy:

Consider an & followed by characters and ; with no spaces an escape sequence

Test it in an separate xml dom

If it works leave it

If it doesn't, escape the & with &
gt/lt strategy

Keep track of the last tag you saw.

if you see > outside of a tag, escape it with >

if you see < outside of a tag, this is a little harder
- read ahead and look for closing >
- if there was no / right before it (no non-space since), add one
- try to parse it in a new dom (easier than parsing spacing, attributes, in/out of them etc.
- error - escape it.

Don't get me wrong, implementing my preprocessor would be a fun coding experience, but I am very busy, and regression testing, fixing what I forgot, and maturing it would blow our budget for this project.

Fortunately, we see that in modern HTML, this is already implemented. As Mark comments in one of my linked answers, "HTML5 allows you to leave it unescaped, but only when the data that follows does not look like a valid character reference". So:

<html> you & i </html> -> you & i

but

<html> you & i </html> -> you & i

and even

<html> you &zz; i </html> -> you &zz; i

So my question is, do any of the xml parsers in the .net framework (or .Net, core, or standard for that matter) allow turning on this behaviour, that is, obey existing valid escape sequences, but if it isn't valid, allow it as a literal?

Sincere thanks for any help :-)

Well, I'm not sure it would work since i'ts meant for parsing html, but have you tried HtmlAgilityPack since xml is similar to a well behavioured html? [this also may help](https://stackoverflow.com/questions/24591206/xml-parsing-with-htmlagilitypack) if you go through that path — Magnetron, Jun 07 '21 at 17:01
Use System.Net.WebUtility.HtmlDecode(string) and System.Net.WebUtility.HtmlEncode(string) — jdweng, Jun 07 '21 at 17:21
I second the HtmlAgilityPack approach, it does a great job on malformed mark-up. — William Walseth, Jun 07 '21 at 17:21
First thing to be clear on: you don't have "a series of XML files" as stated in your first sentence, and you aren't trying "to parse XML" as stated in your question. You are trying to process a non-XML format, which means you are making your life very difficult. Frankly, it's easier to design your own custom format and write your own parser than to use bastardised non-conformant XML. Standards are great when everyone conforms to them, they are no use at all when they don't. — Michael Kay, Jun 07 '21 at 17:47
CData and html(en/de)code don't work as we don't have access to the source producing the errant xml. — FastAl, Jun 07 '21 at 18:48
@MichaelKay I've always preached and agreed with exactly what you say, unfortunately those realities don't change the way the problem came about and don't move towards the solution. admittedly from the cobol programs, some do `inspect ws-name replacing all '&' with '&'` ... and then everything else ... it adds a lot of cost to the cobol programs so a custom standard would have been great (and then convert that to conformant XML). Unfortunately, touching 50 nontrivial legacy programs isn't in our budget let alone modernizing and validating them! Oh well. It's worth griping about anyway! — FastAl, Jun 07 '21 at 18:58
Yes, that's the joy of answering questions on SO, you can tell people how to solve the problem in an ideal world. Having said that, I do believe that a lot of people dig themselves deeper and deeper into holes in an attempt to avoid costs, and thereby just pile up further costs downstream. — Michael Kay, Jun 07 '21 at 20:45

Parse XML with ampersand (&) and <, > with c# on .net framework (can't control source)

0 Answers0