we have a series of XML data files coming from mainframe programs. These are parsed by .Net processes downstream. Some of the inner text fields contain characters like & that need to be escaped
Unfortunately, we can't actually fix all the programs. When a bad &,>, or < comes in .. the fix is to ask the users to edit the mainframe data and spell out the characters! So solutions like this answer won't work.
Some of the program escape their xml properly, e.g., they will replace & in the data with &
and escape the < and > as well. So solutions like this answer won't work either!
One thing I could do is to write a preprocessor that follows rules like this:
amp Strategy:
Consider an & followed by characters and ; with no spaces an escape sequence
Test it in an separate xml dom
If it works leave it
If it doesn't, escape the & with
&
gt/lt strategy
Keep track of the last tag you saw.
if you see > outside of a tag, escape it with
>
if you see < outside of a tag, this is a little harder
read ahead and look for closing >
if there was no / right before it (no non-space since), add one
try to parse it in a new dom (easier than parsing spacing, attributes, in/out of them etc.
error - escape it.
Don't get me wrong, implementing my preprocessor would be a fun coding experience, but I am very busy, and regression testing, fixing what I forgot, and maturing it would blow our budget for this project.
Fortunately, we see that in modern HTML, this is already implemented. As Mark comments in one of my linked answers, "HTML5 allows you to leave it unescaped, but only when the data that follows does not look like a valid character reference". So:
<html> you & i </html>
-> you & i
but
<html> you & i </html>
-> you & i
and even
<html> you &zz; i </html>
-> you &zz; i
So my question is, do any of the xml parsers in the .net framework (or .Net, core, or standard for that matter) allow turning on this behaviour, that is, obey existing valid escape sequences, but if it isn't valid, allow it as a literal?
Sincere thanks for any help :-)