What is the best way of removing rogue ampersands in XML?

Question

(TLDR at the bottom)

We have a legacy system that has implemented its own XML reader/writer. The problem is that it allows a literal "&" inside a property value.

<SB nae="Name" net="HV & DD"/>

When I am reading the data using XDocument.Parse() method, this fails of course. I am looking at ways of sanitizing the data.

I am attempting to use regex to identify cases where this is happening. To illustrate, consider this:

&(?!amp\;)

This will identify ampersand with a negative lookahead to ensure it isn't actually a correctly escaped ampersand. When I have identified these cases, I can substitute with a proper &

Of course, there is a problem that this will match other escaped character such &gt &lt &quot etc, so I need to unmatch those as well. Maybe using a more general form, like a regex unmatching ampersand followed by 2-4 characters and then semicolon.

But my worry is that there are other cases for ampersands that I am not thinking of and that are not represented in the few samples I have got. I am looking for a safe way that will not mess up proper xml.

TLDR: How do I identify ampersands that are not part of proper xml, but are cases of unescaped ampersands in property values?

Could you please provide some more examples of actual xml and the expected/desired matches for each? — jjspace, Oct 23 '18 at 14:08
How much legacy is "legacy"? If ever possible I'd repair the source rather than fix the product. — Fildor, Oct 23 '18 at 14:09
See also [How to parse invalid (bad / not well-formed) XML?](https://stackoverflow.com/q/44765194/290085). — kjhughes, Oct 23 '18 at 18:03

score 2 · Accepted Answer · answered Oct 23 '18 at 14:08

2

You can substitute the following regex pattern with &:

&(?!(?:#\d+|#x[0-9a-f]+|\w+);)

Demo: https://regex101.com/r/3MTLY9/2

answered Oct 23 '18 at 14:08

blhsing

91,368
6
71
106

Very nice. Added, with credit, to my [canonical dealing with bad XML QA](https://stackoverflow.com/q/44765194/290085). Thank you. – kjhughes Oct 23 '18 at 18:04

What is the best way of removing rogue ampersands in XML?

1 Answers1

Linked

Related