1

I was trying to filter out non-valid characters from xml. Although I have successfully done it, I wrote a regex during the process that is working counter-intuitive for me.

Please consider the following .Net regex evaluation:

System.Text.RegularExpressions.Regex.Match("Test", @"[\x01-\x08\x0B-\x0C\x0E-\x1F\xD800-\xDFFF\xFFFE-\xFFFF]+").ToString()

Now my understanding is the Regex pattern matches all non-valid xml characters. According to this page: http://www.w3.org/TR/REC-xml/#NT-Char

These are valid characters:

#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

In my understanding, the regex pattern above is a set difference of remaining Unicode characters (i.e. invalid xml characters). However still running the above statement produces this result:

"Test"

(i.e. the entire input string). I am not able to understand why. In particular, this portion of the regex causes the match: \xD800-\xDFFF

And to me it appears the same is excluded by these 2 groups from valid characters: [#x20-#xD7FF] | [#xE000-#xFFFD]

So I am totally at loss in understanding why a match is produced by the above statement. Can somebody please help me deciphre it.

r_honey
  • 883
  • 4
  • 15
  • 31
  • 3
    [Don't use regexes to parse XML](http://stackoverflow.com/questions/8577060/why-is-it-such-a-bad-idea-to-parse-xml-with-regex). There are several parsers for that. – m0skit0 Jan 22 '13 at 19:47
  • 1
    Hi there, I completely understand that. I am not trying to parse xml, rather clean-up what I have already. So lets not go into the background, and concentrate on where I am wrongly interpreting the regex pattern. – r_honey Jan 22 '13 at 19:50
  • @m0skit0 that's good advice, but the question r_honey is asking is unrelated to parsing xml. – Kevin Brydon Jan 22 '13 at 19:51
  • r_honey it would have been better if you'd asked the question without the whole back-story. – Kevin Brydon Jan 22 '13 at 19:52
  • Hi @KevinBrydon, I thought I cut the story enough, but would surely take your advice for future questions :) – r_honey Jan 22 '13 at 19:55

1 Answers1

3

Try using \u instead of \x.

System.Text.RegularExpressions.Regex.Match("Test", @"[\x01-\x08\x0B-\x0C\x0E-\x1F\uD800-\uDFFF\uFFFE-\uFFFF]+").ToString();

The way I understand it is your current regex is matching the string "Test" because it is essentially matching on the following ranges

\x01-\x08
\x0B-\x0C
\x0E-\x1F
\xD8
0
0-\xDF
F
F
\xFF
FE-\xFF
FF

The match 0-\xDF is likely to be the pattern that matches a wide range of characters.

Kevin Brydon
  • 12,524
  • 8
  • 46
  • 76
  • 2
    According to the documentation, you must use `\u` when using 4 digits and `\x` when using two: http://msdn.microsoft.com/en-us/library/az24scfc.aspx#character_escapes – JDB Jan 22 '13 at 20:08
  • 1
    Hi Kevin, that seems to work. But so does this pattern: @"[^\x09\x0A\x0D\x20-\xD7FF\xE000-\xFFFD\x10000-x10FFFF]". How can this possibly be explained? Plus MSDN says \x is for "exactly" 2 digits and \u for 4, then how about representing 10FFFF? – r_honey Jan 22 '13 at 20:12
  • 2
    @r_honey have a read of this http://msdn.microsoft.com/en-us/library/aa664669%28v=vs.71%29.aspx. hint: use `\U` – Kevin Brydon Jan 22 '13 at 20:26