I was trying to filter out non-valid characters from xml. Although I have successfully done it, I wrote a regex during the process that is working counter-intuitive for me.
Please consider the following .Net regex evaluation:
System.Text.RegularExpressions.Regex.Match("Test", @"[\x01-\x08\x0B-\x0C\x0E-\x1F\xD800-\xDFFF\xFFFE-\xFFFF]+").ToString()
Now my understanding is the Regex pattern matches all non-valid xml characters. According to this page: http://www.w3.org/TR/REC-xml/#NT-Char
These are valid characters:
#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
In my understanding, the regex pattern above is a set difference of remaining Unicode characters (i.e. invalid xml characters). However still running the above statement produces this result:
"Test"
(i.e. the entire input string). I am not able to understand why. In particular, this portion of the regex causes the match: \xD800-\xDFFF
And to me it appears the same is excluded by these 2 groups from valid characters: [#x20-#xD7FF] | [#xE000-#xFFFD]
So I am totally at loss in understanding why a match is produced by the above statement. Can somebody please help me deciphre it.