According to the XML spec, only the following charcters are legal:
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
I have a string named foo
containing the JSON representation of an object. Some strings of the JSON object contain escape sequences for characters that are illegal in XML, e.g. \u0002
and \u000b
.
I want to strip those escape sequences from foo
before throwing it to a JSON to XML converter, because the converter is a black box that provides no ability to handle those invalid characters.
Example for what I would like to do:
MAGIC_REGEX = "<here's what needs to be found>" # TODO
String foo = "\\u0002bar b\\u000baz qu\\u000fx"
String clean_foo = foo.replace(MAGIC_REGEX, "�") # � Unicode replacement character
System.out.println(clean_foo) # Output is "bar baz qux"
How can I achieve that? Bonus points for solutions that use a regex instead of parsing the string and comparing Unicode codepoints.
I am aware of this SO question. However, my problem here are the escape sequences of the illegal characters, not the real characters themselves.