3

Regarding this question: removing invalid XML characters from a string in java, in @McDowell response he/she said that a way to remove invalid XML characters is:

String xml10pattern = "[^"
                + "\u0009\r\n" // #x9 | #xA | #xD 
                + "\u0020-\uD7FF" // [#x20-#xD7FF]
                + "\uE000-\uFFFD" // [#xE000-#xFFFD] 
                + "\ud800\udc00-\udbff\udfff" // [#x10000-#x10FFFF]
                + "]";

and then:

replaceAll(xml10pattern, "");

Well, I have two questions:

  • Shouldn't all unicode characters be escaped? I mean \\u0009\\u000A\\u000D..., instead of \u0009\r\n, like I've seen in @ogrisel's response: Stripping Invalid XML characters in Java
  • I don't undestand how last range (U+10000–U+10FFFF) converts into "\ud800\udc00-\udbff\udfff". Couldn't it be "\u10000-\u10FFFF"?

I really have to detect or filter this kind of characters, and I'm not completely sure how to do it.

By the way, this have to work on JDK 1.5 (so, expressions like \x{h...h} are not allowed)

Thanks a lot.

======UPDATE======

The way I was thinking to detect if an String str contains such invalid characters is:

if (!str.replaceAll(pattern, "").equals(str)) { 
    // Contains non XML valid characters. 
}

Any other advice would be very welcome ;)

Community
  • 1
  • 1
Albert
  • 1,156
  • 1
  • 15
  • 27
  • 2
    As to your second question, the answer is no; a Java char is a UTF-16 code unit, therefore you need to match surrogate pairs here. Note however that since Java 1.7 you can also write `\x10000-\x10FFFF` instead. – fge Mar 12 '15 at 12:35
  • @fge, how is this done? I don't understand how `U+10000` converts into `\ud800\udc00` – Albert Mar 12 '15 at 12:39
  • 1
    The best I can give you here is [a Wikipedia link](http://en.wikipedia.org/wiki/UTF-16#U.2B10000_to_U.2B10FFFF) :) It explains how the leading and trailing surrogates are generated. – fge Mar 12 '15 at 12:41
  • Nice! I've been looking for that, but I didn't notice about `UTF-16`. – Albert Mar 12 '15 at 12:47
  • I'm in a JAVA 6 environment (`IBM J9 VM (build 2.6, JRE 1.6.0 Linux x86-32`) and get an error `Illegal character range near index 56` Expression: `[^\u0009\r\n\u0020-\uD7FF\uE000-\uFFFD\ud800\udc00-\udbff\udfff]`. 56 points to the backslash in front of `udfff`. – Bernhard Döbler Nov 27 '18 at 12:14

1 Answers1

3

1) it works both ways, \u0009 is java escape sequence, \\u0009 is regex escape sequence

2) Java String is UTF-16 encoded, U+10000 is encoded with 2 16-bit characters \ud800\udc00, see Character API Unicode Character Representations

Evgeniy Dorofeev
  • 133,369
  • 30
  • 199
  • 275
  • Respect 1), which one would I use with `replaceAll`. Both would work? The way I'was thinking to detect its use is: `str.replaceAll(pattern, "").equals(str)` – Albert Mar 12 '15 at 12:55
  • 1
    String str2 = str.replaceAll(pattern, ""); - you'll get str2 with non-xml chars removed. Copypaste pattern in your question to your code. It works, without double backslash – Evgeniy Dorofeev Mar 12 '15 at 13:00