3

(TLDR at the bottom)

We have a legacy system that has implemented its own XML reader/writer. The problem is that it allows a literal "&" inside a property value.

<SB nae="Name" net="HV & DD"/>

When I am reading the data using XDocument.Parse() method, this fails of course. I am looking at ways of sanitizing the data.

I am attempting to use regex to identify cases where this is happening. To illustrate, consider this:

&(?!amp\;)

This will identify ampersand with a negative lookahead to ensure it isn't actually a correctly escaped ampersand. When I have identified these cases, I can substitute with a proper &

Of course, there is a problem that this will match other escaped character such &gt &lt &quot etc, so I need to unmatch those as well. Maybe using a more general form, like a regex unmatching ampersand followed by 2-4 characters and then semicolon.

But my worry is that there are other cases for ampersands that I am not thinking of and that are not represented in the few samples I have got. I am looking for a safe way that will not mess up proper xml.

TLDR: How do I identify ampersands that are not part of proper xml, but are cases of unescaped ampersands in property values?

Tormod
  • 4,551
  • 2
  • 28
  • 50
  • Could you please provide some more examples of actual xml and the expected/desired matches for each? – jjspace Oct 23 '18 at 14:08
  • How much legacy is "legacy"? If ever possible I'd repair the source rather than fix the product. – Fildor Oct 23 '18 at 14:09
  • See also [How to parse invalid (bad / not well-formed) XML?](https://stackoverflow.com/q/44765194/290085). – kjhughes Oct 23 '18 at 18:03

1 Answers1

2

You can substitute the following regex pattern with &amp;:

&(?!(?:#\d+|#x[0-9a-f]+|\w+);)

Demo: https://regex101.com/r/3MTLY9/2

blhsing
  • 91,368
  • 6
  • 71
  • 106
  • Very nice. Added, with credit, to my [canonical dealing with bad XML QA](https://stackoverflow.com/q/44765194/290085). Thank you. – kjhughes Oct 23 '18 at 18:04