3

What are the utf-8 characters that would break the xml.

I'm passing a utf-8 string in the xml and i wan't to make sure that non of the characters would break the xml.

Ali_IT
  • 7,551
  • 8
  • 28
  • 44

3 Answers3

3

You are looking at this from the wrong perspective. It is not a matter of which UTF-8 sequences will break XML. UTF-8 is just a encoding scheme, and the XML spec does not deal in encodings, it deals in Unicode codepoints instead. It just happens that XML can be encoded in UTF-8, but again that is a encoding scheme, not a processing scheme.

So the real question is:

Which Unicode codepoints, when decoded from a UTF-8 string, would break XML.

And the answer to that is clearly described in the XML spec itself, which outlines which codepoints are allowed and restricted in the various sections of XML. For example:

Text characters are defined as:

Char    ::=    #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */ 

...

Note:

Document authors are encouraged to avoid "compatibility characters", as defined in section 2.3 of [Unicode]. The characters defined in the following ranges are also discouraged. They are either control characters or permanently undefined Unicode characters:

[#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDEF],
[#x1FFFE-#x1FFFF], [#x2FFFE-#x2FFFF], [#x3FFFE-#x3FFFF],
[#x4FFFE-#x4FFFF], [#x5FFFE-#x5FFFF], [#x6FFFE-#x6FFFF],
[#x7FFFE-#x7FFFF], [#x8FFFE-#x8FFFF], [#x9FFFE-#x9FFFF],
[#xAFFFE-#xAFFFF], [#xBFFFE-#xBFFFF], [#xCFFFE-#xCFFFF],
[#xDFFFE-#xDFFFF], [#xEFFFE-#xEFFFF], [#xFFFFE-#xFFFFF],
[#x10FFFE-#x10FFFF].

Whitespace characters are defined as:

S    ::=    (#x20 | #x9 | #xD | #xA)+ 

Name and token characters are defined as:

NameStartChar    ::=    ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF] 

     NameChar    ::=    NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040] 

Just to name a few. There are many more definitions for characters in Literals, Comments, Character data, Processing Instructions, CData sections, etc etc etc.

So, you need to read the XML spec to know which Unicode codepoints are allowed in any given context within the XML. Different sections and syntax elements have different rules about what is and is not acceptable.

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
2

UTF-8 was designed to never break ASCII or in this case <>& of XML. It also cannot eat an XML character. Normal 7-bits ASCII will never occur in a multi-byte sequence (as there the high bit is 1).

One problem is the redundant BOM-Character at the file's beginning, a zero-width space (hence invisible). Used to detect UTF-8 / UTF-16LE / UTF-16BE, but sometimes XML parsing will fail on a BOM.

To remove a BOM at the beginning of a String:

String xml = "...";
xml = xml.replaceFirst("^\uFEFF", "");

However deprecated Unicode characters exist that are deprecated in XML too: Characters not suitable for use with markup. (Relates more to HTML.)

And then there is XML higher than version 1.0 that may have Unicode in the tag names. Here it is advisable to use a canonical version of how one composes letters with accents.

For instance the Unicode letter ĉ can eiter be one single char c-circumflex: "\u0109" or two chars, c and combining-diacritical-mark circumflex "c\u0302". As one would not see a difference a normalization seems in order.

xml = java.text.Normalizer.normalize(xml, Normalizer.Form.NFKC);
Joop Eggen
  • 107,315
  • 7
  • 83
  • 138
2

The XML recommendation states which characters can be used in an XML document. Basically it disallows all control characters (except tab, line feed and carriage return), surrogate blocks and U+FFFE, and U+FFFF.

Note that the element and attribute names have some extra restrictions which for example disallow several punctuation characters. There is a more specific answer on XML names

Community
  • 1
  • 1
jasso
  • 13,736
  • 2
  • 36
  • 50