143

I was wondering if there is any way to escape a CDATA end token (]]>) within a CDATA section in an xml document. Or, more generally, if there is some escape sequence for using within a CDATA (but if it exists, I guess it'd probably only make sense to escape begin or end tokens, anyway).

Basically, can you have a begin or end token embedded in a CDATA and tell the parser not to interpret it but to treat it as just another character sequence.

Probably, you should just refactor your xml structure or your code if you find yourself trying to do that, but even though I've been working with xml on a daily basis for the last 3 years or so and I have never had this problem, I was wondering if it was possible. Just out of curiosity.

Edit:

Other than using html encoding...

Aaron Digulla
  • 321,842
  • 108
  • 597
  • 820
Juan Pablo Califano
  • 12,213
  • 5
  • 29
  • 42
  • 4
    First, i accept the answer as correct but note: Nothing precludes someone from encoding `>` as `>` within CData to ensure embedded `]]>` will not be parsed as CDEnd. It simply means it's unexpected and that `&` must FIRST be encoded as `&` too so that the data can be properly decoded. Users of the document must know to decode this CData too. It's not unheard of since part of the purpose of CData is to contain content that a specific consumer understands how to handle. Such a CData just can't be expected to be interpreted properly by any generic consumer. – nix May 16 '11 at 14:54
  • 1
    @nix, CDATA just provides an explicit way to declare text node content such that language tokens within (other than ]]>) do not get parsed. It specifically does not expand entity references like > for this reason, so in a CDATA block, that just means those four characters, not '>'. To put it in perspective: in the xml spec, all text content is called "cdata", not just these sequences ("character data"). Also it's not about specific consuming agents. (Such a thing does exist though -- processing instructions (). – Semicolon Oct 11 '15 at 05:37
  • (I should add, even if this sort of thing runs contrary to the original intent of the node, all is fair in the long & torturous battle with XML. I just feel it could be useful for readers to know that <![CDATA[]]> was not actually designed for that purpose.) – Semicolon Oct 11 '15 at 05:48
  • 1
    @Semicolon `CDATA` was designed to allow _anything_: _they are used to escape blocks of text containing characters which would otherwise be recognized as markup_ That implies `CDATA` too since it is also markup. But, in fact, you don't need the double encoding I implied. `]]>` is an acceptable means of encoding a `CDEnd` within a `CDATA`. – nix Oct 11 '15 at 19:14
  • True, you wouldn't need double encoding -- but you would still need the agent to have special knowledge, since the parser wouldn't parse > as >. That's what you mean though, I think? That you could replace them as you see fit, after parsing? – Semicolon Oct 11 '15 at 19:41

10 Answers10

179

You have to break your data into pieces to conceal the ]]>.

Here's the whole thing:

<![CDATA[]]]]><![CDATA[>]]>

The first <![CDATA[]]]]> has the ]]. The second <![CDATA[>]]> has the >.

S.Lott
  • 384,516
  • 81
  • 508
  • 779
  • 1
    Thanks for your answer. I was rather looking for something like a backslash equivalent (within strings in C, PHP, Java, etc). According to the rule quoted by ddaa, it seems like there's not such a thing. – Juan Pablo Califano Oct 21 '08 at 23:11
  • 32
    This should be the accepted answer. *Escaping* is a slightly ambiguous term, but this answer definitely addresses the spirit of *escaping*. Too bad it doesn't fit the OP's narrow conception of *escaping*, which arbitrarily requires the backslash character to be involved for some reason. – G-Wiz Jan 14 '11 at 16:36
  • I like that I can "get" this answer. –  Aug 13 '12 at 04:29
  • 6
    So in summary, escape `]]>` as `]]]]><![CDATA[>`. 5 times the length... wow. But then, it's an uncommon sequence. – Brilliand Mar 05 '13 at 16:50
  • 5
    Not only is the 5x length hilarious, it's not even an uncommon sequence in code, which is the main use case of CDATA! Assuming compressed JavaScript which removes spaces, you could be accessing a field by name from an array of names by index, such as "if(fields[fieldnames[0]]>3)" and now you have to change it to "if(fields[fieldnames[0]]]]><![CDATA[>3)", which defeats of purpose of using CDATA to make it more readable, LOL. I'd like to verbally slap whoever came up with the CDATA syntax. – Triynko Apr 17 '13 at 20:13
  • 1
    Escaping, or more correctly, quoting, means inserting some text in a context where the raw text has meaning WITHOUT leaving the context. It has nothing to do with backslashes. And this answer is not escaping or quoting since it produces two CDATA sections instead of one. – ddaa May 03 '13 at 12:56
  • 1
    The Wikipedia article for CDATA is actually really good - the issue of how to escape ]]> is answered by https://en.wikipedia.org/wiki/CDATA#Nesting -- and the unobvious (subtle but evil) issue of encoding is discussed too https://en.wikipedia.org/wiki/CDATA#Issues_with_encoding where problems can occur because CDATA can contain characters which are invalid for the XML encoding, but can't be converted to anything valid because they are within the CDATA section. – robocat Jun 29 '15 at 02:07
  • @Triynko: this is a good example. It would be enough to insert a single space: `if(fields[fieldnames[0]] >3)`, or two spaces around `>`, but this makes automatic JS minification harder. (`>` and `>>` operators?). – Tomasz Gandor Jul 01 '15 at 08:50
  • 2
    Those arguing about the meaning of "escape" are being pedantic. It's like saying you can't call `a=''` or `foo.com/bar%20gaz` escaping, just because though linguistically accurate, it's not the exact technical nomenclature. Yes there are multiple CDATA sections, and yes in rare cases this matters. But according to Oxford the broad definition in computing is to "cause subsequent character(s) to be interpreted differently". Which in this case and the cases mentioned, happens. – Beejor Oct 03 '16 at 03:41
151

You cannot escape a CDATA end sequence. Production rule 20 of the XML specification is quite clear:

[20]    CData      ::=      (Char* - (Char* ']]>' Char*))

EDIT: This product rule literally means "A CData section may contain anything you want BUT the sequence ']]>'. No exception.".

EDIT2: The same section also reads:

Within a CDATA section, only the CDEnd string is recognized as markup, so that left angle brackets and ampersands may occur in their literal form; they need not (and cannot) be escaped using "&lt;" and "&amp;". CDATA sections cannot nest.

In other words, it's not possible to use entity reference, markup or any other form of interpreted syntax. The only parsed text inside a CDATA section is ]]>, and it terminates the section.

Hence, it is not possible to escape ]]> within a CDATA section.

EDIT3: The same section also reads:

2.7 CDATA Sections

[Definition: CDATA sections may occur anywhere character data may occur; they are used to escape blocks of text containing characters which would otherwise be recognized as markup. CDATA sections begin with the string "<![CDATA[" and end with the string "]]>":]

Then there may be a CDATA section anywhere character data may occur, including multiple adjacent CDATA sections inplace of a single CDATA section. That allows it to be possible to split the ]]> token and put the two parts of it in adjacent CDATA sections.

ex:

<![CDATA[Certain tokens like ]]> can be difficult and <invalid>]]> 

should be written as

<![CDATA[Certain tokens like ]]]]><![CDATA[> can be difficult and <valid>]]> 
andrewrk
  • 30,272
  • 27
  • 92
  • 113
ddaa
  • 52,890
  • 7
  • 50
  • 59
  • 2
    Indeed. Well, I'm not an academic type but as I said in the question, I'm just curious about this. To be honest, I'll just take your word on this, because I can barely make sense out of the syntax used for the rule. Thanks for your answer. – Juan Pablo Califano Oct 21 '08 at 23:17
  • 1
    It reads like this: Char* (the set of all character sequences) - (except) Char* ']]>' Char* (the set of all character sequences that include the substring ']]>'). – ddaa Oct 22 '08 at 09:12
  • Thanks for the extra clarification. I'm accepting your answer as the one that better addresses the question I asked. (S. Lott's answer provides a work-around, which is fine, although it doesn't specifically deal with an actual escape char or sequence. – Juan Pablo Califano Oct 22 '08 at 12:01
  • 52
    This is not an academic question. Think about an RSS feed of a blog post that contains a discussion about CDATA. – usr Jul 12 '11 at 15:05
  • 4
    I meant "academic" in the sense: "interesting to discuss, but without practical use". Generally, CDATA is not useful, it's just a way to serialize XML text, and it's semantically equivalent to escaping special chars using character entities < > and ". Characters entities is the simplest, most robust and most general solution, so use that instead of CDATA sections. If you use a proper XML library (instead of building XML out of strings) you don't even have to think about it. – ddaa Jan 12 '12 at 10:26
  • 5
    I just got bitten by this one because I am trying to encode some compressed Javascript into a ` and my javascript includes just that sequence! I like the idea of splitting into multiple CDATA sections ... – NickZoic Mar 23 '12 at 01:11
  • If you were to add a CDATA code snippet in Sublime Text, it would require that you escape the ending sequence (configuration of Sublime is done almost exclusively through JSON and XML files). – Nick T Apr 24 '13 at 21:27
  • @NickT Instead of escaping the ending text in Sublime, you can do this: `]${1:Delete me then move along--required to escape CDATA end-tag}]>`. Tools > New Snippet... annoys me, because it prints the snippet template into a new file. I don't want it a new file, so I just duplicated the blank snippet text itself into another snippet file...hence the need. – aliteralmind Dec 05 '14 at 22:04
  • 6
    I experienced this in the real world. While reading the wikipedia dump and writing another xml file I encountered this on the page for the [National Transportation Safety Board](https://en.wikipedia.org/wiki/National_Transportation_Safety_Board). It contained _US$>100 million (2013)_ for the budget in the infobox. The source xml contained `[[United States dollar|US$]]>100 million (2013)` which was translated to `[[United States dollar|US$]]>100 million (2013)` by the reader and the writer opted to use CDATA to escape the text and failed. – Paul Jackson Oct 15 '15 at 14:03
  • 1
    @ddaa re: `it's just a way to serialize XML text` or binary (unprintable) data. re: `Characters entities is the simplest, most robust and most general solution` for text that might confuse the XML parser, but if there are lots of them, it may be more space efficient to use CDATA. – Jesse Chisholm Nov 11 '15 at 18:01
  • re: `If you use a proper XML library` and a proper library will have methods for adding CDATA (printable or unprintable) which will deal with the escape for you, if it needs to. Using a proper library is definitely the way to go. – Jesse Chisholm Nov 11 '15 at 18:32
  • Re @jesse-chisholm: I am not sure what you are trying to say. CDATA might be more space efficient, but not in a way that should matter, since nobody should be transferring xml data that is not gzipped. After parsing, the memory usage should be the same. – ddaa Nov 16 '15 at 13:05
  • @ddaa: I was referring to the comment `Characters entities is the simplest, most robust and most general solution, so use that instead of CDATA sections. If you use a proper XML library (instead of building XML out of strings) you don't even have to think about it.` I was agreeing that using a proper library was better than building XML by hand, but disagreeing that entities are _always_ the most robust, because if you have _lots_ of them, then a CDATA is more efficient. Either way a `proper library` will handle it for you. And `gzip` makes the data binary which really needs CDATA. – Jesse Chisholm Nov 17 '15 at 14:31
  • 1
    so, the answer is obvious: ]]> must be replaced with: ]]>]]><![CDATA[, in other words: close the current CDATA, type a "normal" ]]> but escaping the closing > and then open another CDATA. This would to the trick. – Raul Luna May 30 '16 at 11:30
  • The answer is correct. CDATA sections do not escape content. I disagree whether this is academic though. If you are using XML format to store content in CDATA sections, then you can't store any XML content since it cannot tell the difference between content and markup. For this reason, the design of XML is broken. It fails the fundamental rule of parsing and delimiters: that you can embed delimiters in content with escaping. The design of CDATA breaks this rule. There are plenty other things wrong with XML as well, like how it's entitled to mess with whitespace in content. Use JSON. – Adrien Mar 22 '17 at 20:57
  • My point is that CDATA is useless in XML. It adds no expressiveness (everything you can do with CDATA you can do without it) and it provides an idiom that invites incorrect an fragile patterns: producing XML by string interpolation, and consuming XML without a proper parser. Therefore CDATA must be avoided. Therefore limitations in CDATA are "academic". – ddaa Mar 24 '17 at 06:00
  • This is quite nice when you are trying to remove character data from html – Jamisco Sep 19 '17 at 23:53
  • Good answer, though I'd actually call *escaping* replacing `]]>` with `]]]]><![CDATA[>`, which, as you demonstrated, works. – BenMorel Nov 19 '19 at 16:07
  • downvoted for claiming the question is "academic". edit: I edited the answer to not make this bad assertion and then un-down-voted. – andrewrk Aug 14 '23 at 00:15
23

simply replace ]]> with ]]]]><![CDATA[>

Thomas Grainger
  • 2,271
  • 27
  • 34
17

You do not escape the ]]> but you escape the > after ]] by inserting ]]><![CDATA[ before the >, think of this just like a \ in C/Java/PHP/Perl string but only needed before a > and after a ]].

BTW,

S.Lott's answer is the same as this, just worded differently.

Jason Pyeron
  • 2,388
  • 1
  • 22
  • 31
  • 3
    This way of saying it gives people the wrong idea. This is **not** escaping. `]]]]><![CDATA[>` isn't some magical sequence for `]]>`. `]]]]>` has `]]` characters as data, and `]]>` ends the current CDATA section. `<![CDATA[>` starts a new CDATA section and puts `>` in it. They are actually two different elements and will be treated differently when working with a DOM parser. You should be aware of that. This way of doing it is similar to `]]]><![CDATA[]>`, except it puts `]` in the first and `]>` in the second CDATA. The difference remains. – Aidiakapi Apr 11 '13 at 12:24
  • 1
    The difference is overstated, since CDATA content is treated as a literal span of escaped text. Only when messing with the DOM does it really matter, and at that level you're dealing with other invisible boundaries anyway like text, comment, and processing instruction nodes. – Beejor Oct 03 '16 at 03:26
7

S. Lott's answer is right: you don't encode the end tag, you break it across multiple CDATA sections.

How to run across this problem in the real world: using an XML editor to create an XML document that will be fed into a content-management system, try to write an article about CDATA sections. Your ordinary trick of embedding code samples in a CDATA section will fail you here. You can imagine how I learned this.

But under most circumstances, you won't encounter this, and here's why: if you want to store (say) the text of an XML document as the content of an XML element, you'll probably use a DOM method, e.g.:

XmlElement elm = doc.CreateElement("foo");
elm.InnerText = "<[CDATA[[Is this a problem?]]>";

And the DOM quite reasonably escapes the < and the >, which means that you haven't inadvertently embedded a CDATA section in your document.

Oh, and this is interesting:

XmlDocument doc = new XmlDocument();

XmlElement elm = doc.CreateElement("doc");
doc.AppendChild(elm);

string data = "<![[CDATA[This is an embedded CDATA section]]>";
XmlCDataSection cdata = doc.CreateCDataSection(data);
elm.AppendChild(cdata);

This is probably an ideosyncrasy of the .NET DOM, but that doesn't throw an exception. The exception gets thrown here:

Console.Write(doc.OuterXml);

I'd guess that what's happening under the hood is that the XmlDocument is using an XmlWriter produce its output, and the XmlWriter checks for well-formedness as it writes.

Robert Rossney
  • 94,622
  • 24
  • 146
  • 218
  • Well, I had an almost "real world" example. I usually load Xml from Flash that contains html markup within CDATA sections. Having a way to escape it could be useful, I guess. But anyway, in that case, the CDATA content is usually valid XHTML, and so the "outer" CDATA could be avoided altogether. – Juan Pablo Califano Oct 22 '08 at 00:18
  • 2
    CDATA can nearly always be avoided altogether. I find that people who struggle with CDATA very frequently don't understand what they're really trying to do and/or how the technology they're using really works. – Robert Rossney Oct 24 '08 at 08:44
  • Oh, I should also add that the only reason that the CMS I alluded to in my answer used CDATA was that I wrote it, and I didn't understand what I was really trying to do and/or how the technology works. I didn't need to use CDATA. – Robert Rossney Oct 24 '08 at 08:48
  • If you're using .net, the preceding comment about CDATA being avoidable is spot on - just write the content as a string and the framework will do all the escaping (and unescaping on read) for you from the real world....... xmlStream.WriteStartElement("UnprocessedHtml"); xmlStream.WriteString(UnprocessedHtml); xmlStream.WriteEndElement(); – Mark Mullin Aug 08 '10 at 15:28
3

Here's another case in which ]]> needs to be escaped. Suppose we need to save a perfectly valid HTML document inside a CDATA block of an XML document and the HTML source happens to have it's own CDATA block. For example:

<htmlSource><![CDATA[ 
    ... html ...
    <script type="text/javascript">
        /* <![CDATA[ */
        -- some working javascript --
        /* ]]> */
    </script>
    ... html ...
]]></htmlSource>

the commented CDATA suffix needs to be changed to:

        /* ]]]]><![CDATA[> *//

since an XML parser isn't going to know how to handle javascript comment blocks

  • This is not a special case. Simply replace `]]>` with `]]]]><![CDATA[>` still applies here. The fact that it's JavaScript, or commented is not important. – Thomas Grainger Jun 24 '16 at 11:43
1

In PHP: '<![CDATA['.implode(explode(']]>', $string), ']]]]><![CDATA[>').']]>'

1

A cleaner way in PHP:

   function safeCData($string)
   {
      return '<![CDATA[' . str_replace(']]>', ']]]]><![CDATA[>', $string) . ']]>';
   }

Don't forget to use a multibyte-safe str_replace if required (non latin1 $string):

   function mb_str_replace($search, $replace, $subject, &$count = 0)
   {
      if (!is_array($subject))
      {
         $searches = is_array($search) ? array_values($search) : array ($search);
         $replacements = is_array($replace) ? array_values($replace) : array ($replace);
         $replacements = array_pad($replacements, count($searches), '');
         foreach ($searches as $key => $search)
         {
            $parts = mb_split(preg_quote($search), $subject);
            $count += count($parts) - 1;
            $subject = implode($replacements[$key], $parts);
         }
      }
      else
      {
         foreach ($subject as $key => $value)
         {
            $subject[$key] = mb_str_replace($search, $replace, $value, $count);
         }
      }
      return $subject;
   }
Alain Tiemblo
  • 36,099
  • 17
  • 121
  • 153
0

I'd just like to add that it also works if you break the CDATA end tag ]]> between the ]], like this: ] ]]><![CDATA[ ]>

ex.

<![CDATA[Certain tokens like ]]]><![CDATA[]> can be difficult and <valid> but <unconventional>]]> 

However, it is the globally accepted convention to break the ]]> before the > as shown in the other answers here.

<![CDATA[Certain tokens like ]]]]><![CDATA[> can be difficult and <valid> and <conventional>]]> 
MrWatson
  • 476
  • 6
  • 11
-2

See this structure:

<![CDATA[
   <![CDATA[
      <div>Hello World</div>
   ]]]]><![CDATA[>
]]>

For the inner CDATA tag(s) you must close with ]]]]><![CDATA[> instead of ]]>. Simple as that.

2Yootz
  • 3,971
  • 1
  • 36
  • 31