Don't understand the regular expression for valid XML charset

Question

As w3c describe the valid chars for XML is limited.

We can recognize invalid char by following regular expression:

/*
 * From xml spec valid chars:
 * #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
 * any Unicode character, excluding the surrogate blocks, FFFE, and FFFF.
 */
    Pattern pattern = Pattern.compile("[^\\x09\\x0A\\x0D\\x20-\\xD7EF\\xE000-\\xFFFD\\x10000-x10FFFF]");

But I dont know why the expression isn't :

 Pattern pattern = Pattern.compile("[^\\x09\\x0A\\x0D\\x20-\\xD7EF\\xE000-\\xFFFD\\x10000-\\x10FFFF]");

The error message is :

java.util.regex.PatternSyntaxException: Illegal character range near index 49
[^\x09\x0A\x0D\x20-\xD7EF\xE000-\xFFFD\x10000-\x10FFFF]

possible duplicate of [removing invalid XML characters from a string in java](http://stackoverflow.com/questions/4237625/removing-invalid-xml-characters-from-a-string-in-java) — Brian Roach, Apr 18 '11 at 06:31
The first expression is definitely wrong. Without the backslash on the last hexademical it makes no sense. — Robin Green, Apr 18 '11 at 07:17
By the way: You are lucky that you got an exception at all. It is only because `'\x10'` is smaller than `'0'`. Your pattern compiles into the following: *any character, except `\x09`, `\x0A`, `\x0D`, some character in the range `\x20 - \xD7`, `E`, `F`, `\xE0`, `0`, some character in the range `0 - \xFF`, `F`, `F`, and so on*. This character class contains many duplicates, and the only error is the character range `0-\x10` at the end. — Roland Illig, Apr 18 '11 at 20:11
https://stackoverflow.com/questions/4237625/removing-invalid-xml-characters-from-a-string-in-java/4237934#4237934 — Nupur Garg, Apr 03 '20 at 10:29

score 3 · Accepted Answer · answered Apr 18 '11 at 07:28

Simple answer: Not every Unicode Code Point can be expressed as a char in Java. This is because a Code Point is identified by a 21-bit number, but a char is only 16 bits wide. Therefore the Code Points starting with U+10000 are encoded using two chars: a High Surrogate followed by a Low Surrogate. The strings and regular expressions work on chars, not on Code Points, so you have to translate them yourself.

Don't understand the regular expression for valid XML charset

1 Answers1