Are CDATA sections really unnecessary?

Question

This question is prompted by the rather militant refusal of developer Michael Rys to include the parsing of CDATA sections into FOR XML PATH because "There is no semantic difference in the data that you store."

I have stored nuggets of HTML in CDATA nodes and other content that requires the use of special or awkward characters. However I don't feel qualified to challenge Rys's controversial assertion because, I suppose, technically he is correct in the scenarios where I've employed CDATA for convenience.

What's really baking my noodle is that, as developers take to the internet begging for advice on how to render CDATA segments using FOR XML PATH, respondents continually direct them to use FOR XML EXPLICIT instead, the XML rendering method Rys cited as being the "query from hell".

If we can really do without CDATA in every use case that anyone can suggest I guess we should stop moaning and reject CDATA usage henceforth. But if there are clearly defined cases where CDATA is essential Rys already undertook that he would bake it into FOR XML PATH going forward in the topmost link in this question.

So which is it to be? Are CDATA sections really relics of the past? Or should Rys pull his finger out and allow for CDATA parsing in FOR XML PATH? And while we're at it, in the meanwhile, are there any hacks for getting FOR XML PATH to return CDATA sections?

score 3 · Answer 1 · answered Dec 01 '10 at 12:30

CDATA sections are unnecessary. They're not a "relic of the past" because they've always been unnecessary.

This does not mean they aren't useful. Look at just about any programming language or library and you can find a large number of things you could do without because they are semantically equivalent to something else, but which are useful if there's a human being sitting there having to write the stuff.

For that matter, even with programmatic production it's also handy that one could take the opposite approach and use CDATA sections for every single piece of c-data (bloaty, but it could have efficiency gains elsewhere).

FOR XML PATH does not involve a human being sitting there having to write the stuff. It's a means of producing valid XML from a the results of an SQL query. (It's also not a matter of parsing CDATA sections, but of producing them - a different matter).

And you can't really complain about FOR XML EXPLICIT being the alternative when you want really fine control - the reason FOR XML EXPLICIT is so nasty to use sometimes is precisely because it gives you really fine control. Indeed, consider if they first added support for CDATA sections and then added support for every other tweak and configuration option that seemed just as vital to someone else out there. How long would it take before FOR XML EXPLICIT was the automatic choice due to it being more straightforward than FOR XML PATH‽

There are four cases where CDATA are useful:

You're sitting at a keyboard typing this stuff in yourself.
You are dealing with a mixing different technologies with different standards designed at different times and which will be interpreted by different parsers in different ways (e.g javascript embedded into XHTML - though it's not 100% necessary here it's a nightmare to do otherwise).
You're trying to parse the XML with something that doesn't understand XML.
You're trying to use something built on a parser that allows low-level access that distinguishes between CDATA sections and other character data and using that low-level access inappropriately.

Funnily enough, these four cases are also the four cases where a ban on accepting CDATA sections can make sense.

Case 1 doesn't apply here, it isn't human-generated code. Case 2 could apply here if you are doing something really crazy. Frankly, the lack of CDATA sections is the least of your worries here; switch to producing simpler XML in the query and transforming it elsewhere. Case 3 could apply here, but it's not fair to complain to the SQL people if it does, when you should complain to the broken XML parser that doesn't treat <example> the same as <![CDATA[<example>]]>. Case 4 could apply here, but again complain to the person who wrote the buggy code, not the SQL people.

An interesting point of view and I understand where you're coming from. To give some context I have received a request from our front end development team to rework some elements of a bespoke client facing API that presents some of our information as an XML schema. Some of the data are the kind that a grassroots kind of AJAX/HTML/CSS/ECMAScript type designer would traditionally stuff into CDATA. I am willing to say it's not possible but I just wanted to know I wasn't being unreasonable in saying that. — One Monkey, Dec 01 '10 at 14:14
It's not possible that way (you *can* use EXPLICIT), but it's also not necessary as it'll look the same to the tools AJAX type coders would use for parsing (XHR), even if it would be less pleasant if they had to write it that way in their text editors. Another way of looking at it - if you're glad that the XML standard made things convenient to you in one way, why should you complain that someone else made use of the different way the XML standard made things convenient to them? — Jon Hanna, Dec 01 '10 at 14:26
Having had a good think about this I think that, although this is all perfectly logical in theory, reality throws a spanner in the works. Essentially the SQL I'm writing has many nodes in it that are fine in FOR XML PATH there's only a couple of contentious ones. The SQL needs to be readily readable by others and easily modified, it provides a tiny but essential part of our web app's functionality. The time cost of maintaining FOR XML EXPLICIT queries on it is not justified. But you can't stop an end user writing IAngleets into a free text field which could cause problems. — One Monkey, Dec 02 '10 at 10:35
Why would that cause problems? If it gets put into a database field then the query will escape it one way or another (i.e. either CDATA sections or `<` etc. - if it did neither then *that* would be a flaw). — Jon Hanna, Dec 02 '10 at 10:58
I suppose so. The problem with healthy paranoia is that it sometimes makes you antsy in illogical ways. I know you're right but I can't help wondering... so I guess we have found CDATA's true purpose, an overanxious developer's security blanket. — One Monkey, Dec 02 '10 at 11:56
It's always worth looking for issues before they happen. In this case though, it's worth noting that CDATA sections simply don't exist at a certain level of abstraction. Now, if something reading XML that is meant to get us to that level won't parse them - that *is* an issue - but if something producing XML doesn't, it doesn't matter at the level of abstraction we should be worrying about. — Jon Hanna, Dec 02 '10 at 12:01

score 2 · Accepted Answer · edited Jun 20 '20 at 09:12

CDATA sections are useful if you don't care about the semantics of the data in them (i.e. you do not need to parse it - it is simply a run of characters), and you don't wish to escape any of the XML within them.

The definition, according to w3:

CDATA sections may occur anywhere character data may occur; they are used to escape blocks of text containing characters which would otherwise be recognized as markup.

From wikipedia:

New authors of XML documents often misunderstand the purpose of a CDATA section, mistakenly believing that its purpose is to "protect" data from being treated as ordinary character data during processing. Some APIs for working with XML documents do offer options for independent access to CDATA sections, but such options exist above and beyond the normal requirements of XML processing systems, and still do not change the implicit meaning of the data. Character data is character data, regardless of whether it is expressed via a CDATA section or ordinary markup.

CDATA sections are useful for writing XML code as text data within an XML document. For example, if one wishes to typeset a book with XSL explaining the use of an XML application, the XML markup to appear in the book itself will be written in the source file in a CDATA section. However, a CDATA section cannot contain the string "]]>" and therefore it is not possible for a CDATA section to contain nested CDATA sections. The preferred approach to using CDATA sections for encoding text that contains the triad "]]>" is to use multiple CDATA sections by splitting each occurrence of the triad just before the ">". For example, to encode "]]>" one would write:

@One Monkey - I would add that in _most_ cases, Michael Rys is right. `CDATA` sections are normally mis-used. If you need to query it, it shouldn't be in `CDATA`. — Oded, Dec 01 '10 at 11:49
This is a genuine question I'm happy to go with whatever is best. I'm genuinely curious to know what people think or what they can come up with. — One Monkey, Dec 01 '10 at 11:53
@One Monkey - I don't doubt the question or your intentions. My answer was given in good faith... — Oded, Dec 01 '10 at 11:54

score 1 · Answer 3 · answered Feb 22 '13 at 22:50

It is interesting to see how someone can just throw a very valuable piece of the Standard with such whimsical approach. Not everyone is using XML for a few hundred characters of HTML or a list of items for a drop down.

Some of us are actually using XML to exchange data, very complex data like a CCD, CDA CDR, these are all standard document formats in the healthcare arena and are becoming more and more prominent with ObamaCare. Part of these documents structure contain attachments things like DiCOM Images, PDF's and other Binary Data that should not be read by the parser the reason the CDATA definition exists.

Why should I pay the overhead of the parser reading a 3 megabyte DiCom image embedded in a CCD document? Why should I be forced to separate the document when it came in the original data and is part of the XML Standard. And I want the be able to locate and recover the document and is contents with XML.

This bewilders me why you all would support the parsing of data that is intended to not be parsed by the engine. If the engine sees CDATA ignore it, it is very simple. And the continued argument that some do not need it is irrelevant. It is part of the standard and the standard should be maintained. If they would like to add a "Feature" as it has been called then support the default behavior with an option.

Please stop parsing CDATA and ignore it.

score 0 · Answer 4 · answered Dec 01 '10 at 12:11

0

You are absolutely right, CDATA are essential in many scenarios, they're part of XML standard and should be supported by every XML manipulation tool/method. But thing is that MS usually dosn't care .. you know, "640kB should be enough for everyone" kind of approach.

Edit: About FOR XML EXPLICIT - this is THE best method for generating precisely formatted XML data. Yes, syntax is kinda painful to look at and confusing, but once you use it feww times, you'll admire its beauty and power.

answered Dec 01 '10 at 12:11

Pavel Urbančík

1,466
9
6

1

They most certainly should not "be supported by every XML manipulation tool/method" when such tools are producing rather than parsing. This is like saying that we shouldn't be allowed to write C# programs that don't use reflection, since it's in the standard, or write internet software that doesn't do broadcasts, since that's in the TCP/IP standard. The **parser** must process CDATA, the producer is free to do so or not as suits. Indeed the fact that the parser must process both CDATA and other character content is a promise given by the XML standard that means the producer is free in this way – Jon Hanna Dec 01 '10 at 12:38

Are CDATA sections really unnecessary?

4 Answers4

Linked