Strip escape sequences for invalid XML characters

Question

According to the XML spec, only the following charcters are legal:

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

I have a string named foo containing the JSON representation of an object. Some strings of the JSON object contain escape sequences for characters that are illegal in XML, e.g. \u0002 and \u000b.

I want to strip those escape sequences from foo before throwing it to a JSON to XML converter, because the converter is a black box that provides no ability to handle those invalid characters.

Example for what I would like to do:

MAGIC_REGEX = "<here's what needs to be found>"  # TODO

String foo = "\\u0002bar b\\u000baz qu\\u000fx"
String clean_foo = foo.replace(MAGIC_REGEX, "�")  # � Unicode replacement character

System.out.println(clean_foo)  # Output is "bar baz qux"

How can I achieve that? Bonus points for solutions that use a regex instead of parsing the string and comparing Unicode codepoints.

I am aware of this SO question. However, my problem here are the escape sequences of the illegal characters, not the real characters themselves.

@Jerry: Of course I could try to write my own hand-crafted regex. The problem is that identifying the illegal characters and illegal multi-byte combinations is already complex: http://stackoverflow.com/a/961504/1388240 So was hoping that someone already ran into the same problem before and solved it. — cyroxx, Sep 10 '13 at 11:27
Ow, okay. I guess this means I'm not sure (probably since I'm not familiar with the topic of xml and unicode chars) what you're trying to do exactly. Good luck! — Jerry, Sep 10 '13 at 12:01
What exactly is the problem with the escaping that the XML parser does automatically? — Hauke Ingmar Schmidt, Sep 10 '13 at 14:20
@his: The parser doesn't really escape. The parser component I am working with seems to parse the JSON, blindly unescape any characters to get a Unicode string, and feed that into an XML library - which obviously crashes because of the illegal characters. That's why I would like to remove the (escaped) illegal characters from the JSON string before feeding it to the parser component. Does this explanation make the problem any clearer? — cyroxx, Sep 12 '13 at 12:02

score 1 · Accepted Answer · answered Sep 10 '13 at 15:39

I finally came up with this regex, which matches almost all illegal characters according to the XML spec, except the ones above #x10000 (#x11000 and onwards):

# case-sensitive version
\\\\u(00(0[^9ADad]|1[0-9A-Fa-f])|D[8-9A-Fa-f][0-9A-Fa-f]{2}|[Ff]{3}[EFef])

# case-insensitive version
\\\\u(00(0[^9ad]|1[0-9a-f])|D[8-9a-f][0-9a-f]{2}|fff[ef])

Strip escape sequences for invalid XML characters

1 Answers1