2

I need to do a replacement on the escaped XML chars > < and &, but only when they are contained within single quotes. This is important because the regex pattern shouldn't be able to find the > and< when they are the beginning and ending tags.

Example, given the string <Element><Element value="'hello&stack<overflow>'"/></Element>

I should only get the > < and & that are within the single quotes '. This is so I can replace them with the proper &amp; &lt; and &gt; (Long story it's the result of a muddled XML parsing that happened).

I know I can use '(.*)' to get all characters in between the single quotes, but now how can I extract only the escaped characters within that.

MH175
  • 2,234
  • 1
  • 19
  • 35

2 Answers2

2

You may match a tag name with all consequent attribute names/values and only replace < and > inside the values (or names as well, depends on how messy your data is).

This can be done within Regex.Replace match evaluator:

var s = "<Element><Element value=\"'hello&stack<overflow>'\" value=\"'hi&stack<over flow2 >'\"/></Element>";
var rx = @"((?:<[a-zA-Z][\w:-]*|\G(?!\A))\s+[^\s=<]*=)(""[^""]*"")";
var clean = Regex.Replace(s, rx, m => 
    string.Format("{0}{1}", m.Groups[1].Value, m.Groups[2].Value.Replace("&", "&amp;").Replace("<", "&lt;").Replace(">", "&gt;"))
);
 // => <Element><Element value="'hello&amp;stack&lt;overflow&gt;'" value="'hi&amp;stack&lt;over flow2 &gt;'"/></Element>

See the C# demo

Here is the regex demo. Details:

  • ((?:<[a-zA-Z][\w:-]*|\G(?!\A))\s+[^\s=<]*=) - Group 1:
    • (?:<[a-zA-Z][\w:-]*|\G(?!\A)) - either <, an ASCII letter, 0+ word chars, :, or - (see <[a-zA-Z][\w:-]*), OR (|) the end of the previous successful match (see \G(?!\A))
    • \s+ - 1+ whitespaces
    • [^\s=<]*= - 0+ chars other than whitespace, = and <
  • ("[^"]*") - Group 2:
    • "[^"]*" - a ", 0+ chars other than " and then a "
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
1

It works for this case. If you can include more inputs, we can improve and cover them as well.

Check this:

(?<!^)(>|<|&)(?=.*')

Demo:

https://regex101.com/r/EgXlcD/2

Mohammad Yusuf
  • 16,554
  • 10
  • 50
  • 78
  • Oops, I spoke too soon, I edited the example with a case where it doesn't work. – MH175 Feb 08 '17 at 06:21
  • 1
    @MH175 Doing it with regex can be a bit difficult. It can be very easily done with an xml parser. I have no exposure to C# or would have done it for you. Check this: http://stackoverflow.com/questions/642293/how-do-i-read-and-parse-an-xml-file-in-c – Mohammad Yusuf Feb 08 '17 at 06:37
  • Unfortunately that's the problem. The parser (XDocument) won't even run until I correct these errors, and throws an exception because it's encountering all these illegal chars. [link](https://msdn.microsoft.com/en-us/library/system.xml.linq.xdocument(v=vs.110).aspx) – MH175 Feb 08 '17 at 06:52
  • Edit: Can be difficult with regex unless you are [Wiktor Stribiżew](http://stackoverflow.com/users/3832970/wiktor-stribi%C5%BCew) – Mohammad Yusuf Feb 08 '17 at 07:48