285

I am working with some XML that holds strings like:

<node>This is a string</node>

Some of the strings that I am passing to the nodes will have characters like &, #, $, etc.:

<node>This is a string & so is this</node>

This is not valid due to &.

I cannot wrap these strings in CDATA as they need to be as they are. I tried looking for a list of characters that cannot be put in XML nodes without being in a CDATA.

Can someone point me in the direction of one or provide me with a list of illegal characters?

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
RailsSon
  • 19,897
  • 31
  • 82
  • 105

15 Answers15

293

OK, let's separate the question of the characters that:

  1. aren't valid at all in any XML document.
  2. need to be escaped.

The answer provided by @dolmen in "https://stackoverflow.com/questions/730133/invalid-characters-in-xml/5110103#5110103" is still valid but needs to be updated with the XML 1.1 specification.

1. Invalid characters

The characters described here are all the characters that are allowed to be inserted in an XML document.

1.1. In XML 1.0

The global list of allowed characters is:

[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

Basically, the control characters and characters out of the Unicode ranges are not allowed. This means also that calling for example the character entity &#x3; is forbidden.

1.2. In XML 1.1

The global list of allowed characters is:

[2] Char ::= [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

[2a] RestrictedChar ::= [#x1-#x8] | [#xB-#xC] | [#xE-#x1F] | [#x7F-#x84] | [#x86-#x9F]

This revision of the XML recommendation has extended the allowed characters so control characters are allowed, and takes into account a new revision of the Unicode standard, but these ones are still not allowed : NUL (x00), xFFFE, xFFFF...

However, the use of control characters and undefined Unicode char is discouraged.

It can also be noticed that all parsers do not always take this into account and XML documents with control characters may be rejected.

2. Characters that need to be escaped (to obtain a well-formed document):

The < must be escaped with a &#60; entity, since it is assumed to be the beginning of a tag.

The & must be escaped with a &#38; entity, since it is assumed to be the beginning a entity reference

The > should be escaped with &#62; entity. It is not mandatory -- it depends on the context -- but it is strongly advised to escape it.

The ' should be escaped with a &#39; entity -- mandatory in attributes defined within single quotes but it is strongly advised to always escape it.

The " should be escaped with a &#34; entity -- mandatory in attributes defined within double quotes but it is strongly advised to always escape it.

John
  • 1
  • 13
  • 98
  • 177
potame
  • 7,597
  • 4
  • 26
  • 33
  • 3
    *" but it is strongly advised to always escape it"* - Could you clarify that bit? Who advises that, and why? (The way I see it, there's nothing wrong with using literal quotes wherever they are syntactically allowed.) – Tomalak Jan 12 '21 at 15:30
  • Shouldn't `'` be escaped as `'` instead ? https://www.w3.org/TR/REC-xml/#syntax – Simon Jan 24 '22 at 10:08
  • 1
    @Simon hey, I didn't notice the answer has been modified because I originally wrote to escape with `'`. However both will work since numeric character reference are equally recognized https://www.w3.org/TR/REC-xml/#dt-charref – potame Jan 24 '22 at 10:41
  • 2
    For 2.: see https://stackoverflow.com/questions/1091945/what-characters-do-i-need-to-escape-in-xml-documents for details. These 5 characters needn't *always* be escaped, just in some circumstances. – Thomas Weller Mar 08 '22 at 12:48
179

The list of valid characters is in the XML specification:

Char       ::=      #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]  /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
dolmen
  • 8,126
  • 5
  • 40
  • 42
  • 8
    You should note that although they are legal characters, `& < > " '` have to be escaped in certain contexts. – D.Shawley May 08 '11 at 19:49
  • 7
    "Legal" in this context means that their final decoded values are legal, not that they are legal in the stream. As above, some legal values have to be escaped in-stream. – SilverbackNet Jul 16 '11 at 01:59
  • I have an issue where 0x1c is an illegal character... Looking for a possibility in java how to avoid these.... – basZero Dec 10 '13 at 09:20
  • A nice overview which characters are valid and which are not can be found here http://validchar.com/d/xml10/xml10_namestart – Dr. Max Völkel Feb 21 '14 at 21:58
  • 8
    @xamde That list is nice, but it only shows the characters that may be used to start an XML element. The issue at hand is which characters are valid in an XML file in general. There are certain characters that are not allowed anywhere. – Jon Senchyna Jun 23 '14 at 19:58
  • My answer is more complete, more visual, and still gives credit to others. - I wonder why you downvoted it ? – bvdb Feb 09 '17 at 20:28
  • I just ran a test where #x1 got written to an XML element perfectly legally, escaped as ``. So what's illegal about it? – Kyle Delaney Mar 12 '18 at 18:05
168

The only illegal characters are &, < and > (as well as " or ' in attributes, depending on which character is used to delimit the attribute value: attr="must use &quot; here, ' is allowed" and attr='must use &apos; here, " is allowed').

They're escaped using XML entities, in this case you want &amp; for &.

Really, though, you should use a tool or library that writes XML for you and abstracts this kind of thing away for you so you don't have to worry about it.

Tomalak
  • 332,285
  • 67
  • 532
  • 628
Welbog
  • 59,154
  • 9
  • 110
  • 123
  • 1
    [And ‘>’ doesn't always *have* to be escaped, either, although it's probably easiest to do so. It's only the string ‘]]>’ that's invalid (in element content). Bit of a strange wart really.] – bobince Apr 08 '09 at 14:58
  • 95
    Some controls characters are also not allowed. See my answer below. – dolmen Feb 24 '11 at 20:36
  • 53
    Actually that's not quite true. A number of lower ascii characters are invalid also. If you try to write 0x03 to an Xml document you get an error typically and if you do manage to properly escape it into an XML document, most viewers will complain about the invalid character. Edge case but it does happen. – Rick Strahl Jan 02 '12 at 09:56
  • 2
    0x1f is also an invalid character in XML 1.0. It's valid though in XML 1.1. – Florian Fankhauser Dec 05 '12 at 14:45
  • 2
    also 0x0B, or "\v", a vertical tab. – Dusda Oct 11 '13 at 00:25
  • 24
    This answer is absolutely wrong. Here is my XML exception with 0x12 illegal character 'System.Xml.XmlException: '', hexadecimal value 0x12, is an invalid character' – George Sep 30 '14 at 15:54
  • 9
    It's also wrong in the other direction; as well as missing every single illegal character, the characters it does claim are illegal are perfectly legal, albeit with special meaning in the context. – Jon Hanna Dec 16 '14 at 14:47
  • 7
    In XML 1.0 there are many illegal characters. In fact even using a character entity for most control characters will cause an error when parsing. – Thayne Nov 17 '15 at 16:21
  • ° this character is not serializable correctly in GWT thus not valid xml character. – Java Main Mar 10 '17 at 16:17
  • Those are not the only illegal characters. For example a vertical tab character will break an xml parser. – Harry May 15 '18 at 07:26
  • Also, > is not illegal as long as it does not follow ]]. – rghome Oct 24 '18 at 13:49
  • "Really, though, you should use a tool or library that writes XML for you and abstracts this kind of thing away for you so you don't have to worry about it." Not really, xml is simple enough that you can write this on your own and be more in control of your data. – TZubiri Mar 01 '20 at 23:39
  • 1
    can this answer no longer be the accepted answer since the highest voted answer is better – Luke Jul 22 '20 at 07:26
64

This is a C# code to remove the XML invalid characters from a string and return a new valid string.

public static string CleanInvalidXmlChars(string text) 
{ 
    // From xml spec valid chars: 
    // #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]     
    // any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. 
    string re = @"[^\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD\u10000-\u10FFFF]"; 
    return Regex.Replace(text, re, ""); 
}
Jon Senchyna
  • 7,867
  • 2
  • 26
  • 46
mathifonseca
  • 1,465
  • 1
  • 15
  • 17
  • 6
    For Java, the regex pattern would be the same. And then you can use the method called replaceAll in the class String that expects a regex pattern as parameter. Check this: http://docs.oracle.com/javase/6/docs/api/java/lang/String.html#replaceAll%28java.lang.String,%20java.lang.String%29 – mathifonseca Dec 10 '13 at 14:37
  • 3
    I have such invalid characters in my string: SUSITARIMO DL DARBO SUTARTIES This code doesn't remove So the xml document fails to init. – Dainius Kreivys Jul 30 '15 at 10:05
  • 2
    I believe you cannot just put this pattern into a .NET regex constructor. I don't think it recognizes `\u10000` and `\u10FFFF` as single characters as they require two utf-16 `char` instances each, and according to the [docs](https://learn.microsoft.com/en-us/dotnet/standard/base-types/character-escapes-in-regular-expressions) there might not be more that 4 digits. `[\u10000-\u10FFFF]` is most likely parsed as [`\u1000`, `0-\u10FF`, `F`, `F`] which is weird looking but legal. – GSerg May 23 '18 at 16:16
  • A better implementation that takes care of the utf-16 characters can be found here: https://stackoverflow.com/a/17735649/1639057 – TheLogicMan Oct 01 '20 at 08:56
  • be careful to use this method, your valid UTF character will also be replaced with empty string, causing unexpected result on application – bijayk Oct 19 '20 at 15:11
  • Use XmlConvert.VerifyXmlChars check instead. var strInput = "बिजय Bijay"; var strOutput = ""; try { strOutput = XmlConvert.VerifyXmlChars(strInput); } catch { strOutput = new string(strInput.Where(ch => XmlConvert.IsXmlChar(ch)).ToArray()); } – bijayk Oct 19 '20 at 15:29
18

The predeclared characters are:

& < > " '

See "What are the special characters in XML?" for more information.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
cgp
  • 41,026
  • 12
  • 101
  • 131
12

In addition to potame's answer, if you do want to escape using a CDATA block.

If you put your text in a CDATA block then you don't need to use escaping. In that case you can use all characters in the following range:

graphical representation of possible characters

Note: On top of that, you're not allowed to use the ]]> character sequence. Because it would match the end of the CDATA block.

If there are still invalid characters (e.g. control characters), then probably it's better to use some kind of encoding (e.g. base64).

bvdb
  • 22,839
  • 10
  • 110
  • 123
  • 3
    Wether in a CDATA block or not, some characters are forbidden in XML. – dolmen Feb 09 '17 at 16:45
  • 7
    exactly, isn't that what I wrote ? quote: "all characters *in the following range*". By which I mean, only the characters in this specific range. Other characters are not allowed. - fully agree ; but I don't understand the downvote. - no hard feelings though. – bvdb Feb 09 '17 at 20:11
9

Another way to remove incorrect XML chars in C# is using XmlConvert.IsXmlChar (Available since .NET Framework 4.0)

public static string RemoveInvalidXmlChars(string content)
{
   return new string(content.Where(ch => System.Xml.XmlConvert.IsXmlChar(ch)).ToArray());
}

or you may check that all characters are XML-valid:

public static bool CheckValidXmlChars(string content)
{
   return content.All(ch => System.Xml.XmlConvert.IsXmlChar(ch));
}

.Net Fiddle

For example, the vertical tab symbol (\v) is not valid for XML, it is valid UTF-8, but not valid XML 1.0, and even many libraries (including libxml2) miss it and silently output invalid XML.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Alex Vazhev
  • 1,363
  • 1
  • 18
  • 17
5

Another easy way to escape potentially unwanted XML / XHTML chars in C# is:

WebUtility.HtmlEncode(stringWithStrangeChars)
tiands
  • 142
  • 2
  • 5
3

For Java folks, Apache has a utility class (StringEscapeUtils) that has a helper method escapeXml which can be used for escaping characters in a string using XML entities.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
A Null Pointer
  • 2,261
  • 3
  • 26
  • 28
3

"XmlWriter and lower ASCII characters" worked for me

string code = Regex.Replace(item.Code, @"[\u0000-\u0008,\u000B,\u000C,\u000E-\u001F]", "");
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Kalpesh Popat
  • 1,416
  • 14
  • 12
2

In summary, valid characters in the text are:

  • tab, line-feed and carriage-return.
  • all non-control characters are valid except & and <.
  • > is not valid if following ]].

Sections 2.2 and 2.4 of the XML specification provide the answer in detail:

Characters

Legal characters are tab, carriage return, line feed, and the legal characters of Unicode and ISO/IEC 10646

Character data

The ampersand character (&) and the left angle bracket (<) must not appear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section. If they are needed elsewhere, they must be escaped using either numeric character references or the strings " & " and " < " respectively. The right angle bracket (>) may be represented using the string " > ", and must, for compatibility, be escaped using either " > " or a character reference when it appears in the string " ]]> " in content, when that string is not marking the end of a CDATA section.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
rghome
  • 8,529
  • 8
  • 43
  • 62
1

In the Woodstox XML processor, invalid characters are classified by this code:

if (c == 0) {
    throw new IOException("Invalid null character in text to output");
}
if (c < ' ' || (c >= 0x7F && c <= 0x9F)) {
    String msg = "Invalid white space character (0x" + Integer.toHexString(c) + ") in text to output";
    if (mXml11) {
        msg += " (can only be output using character entity)";
    }
    throw new IOException(msg);
}
if (c > 0x10FFFF) {
    throw new IOException("Illegal unicode character point (0x" + Integer.toHexString(c) + ") to output; max is 0x10FFFF as per RFC");
}
/*
 * Surrogate pair in non-quotable (not text or attribute value) content, and non-unicode encoding (ISO-8859-x,
 * Ascii)?
 */
if (c >= SURR1_FIRST && c <= SURR2_LAST) {
    throw new IOException("Illegal surrogate pair -- can only be output via character entities, which are not allowed in this content");
}
throw new IOException("Invalid XML character (0x"+Integer.toHexString(c)+") in text to output");

Source from here

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
1
ampersand (&) is escaped to &amp;

double quotes (") are escaped to &quot;

single quotes (') are escaped to &apos; 

less than (<) is escaped to &lt; 

greater than (>) is escaped to &gt;

In C#, use System.Security.SecurityElement.Escape or System.Net.WebUtility.HtmlEncode to escape these illegal characters.

string xml = "<node>it's my \"node\" & i like it 0x12 x09 x0A  0x09 0x0A <node>";
string encodedXml1 = System.Security.SecurityElement.Escape(xml);
string encodedXml2= System.Net.WebUtility.HtmlEncode(xml);


encodedXml1
"&lt;node&gt;it&apos;s my &quot;node&quot; &amp; i like it 0x12 x09 x0A  0x09 0x0A &lt;node&gt;"

encodedXml2
"&lt;node&gt;it&#39;s my &quot;node&quot; &amp; i like it 0x12 x09 x0A  0x09 0x0A &lt;node&gt;"
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
live-love
  • 48,840
  • 22
  • 240
  • 204
-2

Anyone tried this System.Security.SecurityElement.Escape(yourstring)? This will replace invalid XML characters in a string with their valid equivalent.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
klaydze
  • 941
  • 14
  • 36
-5

For XSL (on really lazy days) I use:

capture="&amp;(?!amp;)" capturereplace="&amp;amp;"

to translate all &-signs that aren't follwed på amp; to proper ones.

We have cases where the input is in CDATA but the system which uses the XML doesn't take it into account. It's a sloppy fix, beware...