Regarding this question: removing invalid XML characters from a string in java, in @McDowell response he/she said that a way to remove invalid XML characters is:
String xml10pattern = "[^"
+ "\u0009\r\n" // #x9 | #xA | #xD
+ "\u0020-\uD7FF" // [#x20-#xD7FF]
+ "\uE000-\uFFFD" // [#xE000-#xFFFD]
+ "\ud800\udc00-\udbff\udfff" // [#x10000-#x10FFFF]
+ "]";
and then:
replaceAll(xml10pattern, "");
Well, I have two questions:
- Shouldn't all unicode characters be escaped? I mean
\\u0009\\u000A\\u000D...
, instead of\u0009\r\n
, like I've seen in @ogrisel's response: Stripping Invalid XML characters in Java - I don't undestand how last range
(U+10000–U+10FFFF)
converts into"\ud800\udc00-\udbff\udfff"
. Couldn't it be"\u10000-\u10FFFF"
?
I really have to detect or filter this kind of characters, and I'm not completely sure how to do it.
By the way, this have to work on JDK 1.5 (so, expressions like \x{h...h}
are not allowed)
Thanks a lot.
======UPDATE======
The way I was thinking to detect if an String str
contains such invalid characters is:
if (!str.replaceAll(pattern, "").equals(str)) {
// Contains non XML valid characters.
}
Any other advice would be very welcome ;)