2

As w3c describe the valid chars for XML is limited.

We can recognize invalid char by following regular expression:

/*
 * From xml spec valid chars:
 * #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
 * any Unicode character, excluding the surrogate blocks, FFFE, and FFFF.
 */
    Pattern pattern = Pattern.compile("[^\\x09\\x0A\\x0D\\x20-\\xD7EF\\xE000-\\xFFFD\\x10000-x10FFFF]");

But I dont know why the expression isn't :

 Pattern pattern = Pattern.compile("[^\\x09\\x0A\\x0D\\x20-\\xD7EF\\xE000-\\xFFFD\\x10000-\\x10FFFF]");

The error message is :

java.util.regex.PatternSyntaxException: Illegal character range near index 49
[^\x09\x0A\x0D\x20-\xD7EF\xE000-\xFFFD\x10000-\x10FFFF]
Raedwald
  • 46,613
  • 43
  • 151
  • 237
Villim
  • 77
  • 1
  • 6
  • possible duplicate of [removing invalid XML characters from a string in java](http://stackoverflow.com/questions/4237625/removing-invalid-xml-characters-from-a-string-in-java) – Brian Roach Apr 18 '11 at 06:31
  • The first expression is definitely wrong. Without the backslash on the last hexademical it makes no sense. – Robin Green Apr 18 '11 at 07:17
  • By the way: You are lucky that you got an exception at all. It is only because `'\x10'` is smaller than `'0'`. Your pattern compiles into the following: *any character, except `\x09`, `\x0A`, `\x0D`, some character in the range `\x20 - \xD7`, `E`, `F`, `\xE0`, `0`, some character in the range `0 - \xFF`, `F`, `F`, and so on*. This character class contains many duplicates, and the only error is the character range `0-\x10` at the end. – Roland Illig Apr 18 '11 at 20:11
  • https://stackoverflow.com/questions/4237625/removing-invalid-xml-characters-from-a-string-in-java/4237934#4237934 – Nupur Garg Apr 03 '20 at 10:29

1 Answers1

3

Simple answer: Not every Unicode Code Point can be expressed as a char in Java. This is because a Code Point is identified by a 21-bit number, but a char is only 16 bits wide. Therefore the Code Points starting with U+10000 are encoded using two chars: a High Surrogate followed by a Low Surrogate. The strings and regular expressions work on chars, not on Code Points, so you have to translate them yourself.

Roland Illig
  • 40,703
  • 10
  • 88
  • 121