PCDATA vs CDATA in XML DTD

Question

In XML DTD's - When defining an element , we use #PCDATA to say that this element can contain any parseable text. When defining an attribute , we use CDATA to say that its value can be any character data.

CDATA as is used in XML is something which is not parsed by the XML parser (Multi character escape sequence). Consistently, when we use CDATA for defining an attribute ; the parser should not parse it. But , it does!

Then , Why Could not PCDATA have been used in place of CDATA for defining attributes?

Update - This has been kept this way to be backward compatible with SGML. What's the reasoning behind such naming in SGML ?

possible duplicate of http://stackoverflow.com/questions/918450/difference-between-pcdata-and-cdata-in-dtd — Mathias Müller, Dec 09 '13 at 15:42
this one is based on the conclusion of the question you mention... — nikel, Dec 09 '13 at 16:24
How are you using CDATA for an attribute? This should not be possible. http://stackoverflow.com/q/359280/231316 — Chris Haas, Dec 09 '13 at 16:38
I meant the case when CDATA is used for defining the type of attribute..not the CDATA section.. — nikel, Dec 09 '13 at 16:48

Javier · Answer 1 · 2013-12-23T15:09:19.570

When used in the declared value of an attribute CDATA refers to the actual value of the attribute (character data), not to the context in which it is parsed. On the other hand, when parsing elements we need a distinction between character-data-with-no-markup (CDATA) and parsed-character-data-where-delimiters-are expected (PCDATA) .

At first glance this seems arbitrary, but it is not (see here and here).

In SGML, an attribute value specification may either be quoted (attribute value literal) or unquoted (attribute value).

attribute value specification = attribute value literal | attribute value

When the attribute is unquoted, only NAME-characters are allowed and that may be further restricted for some declared values such as NUMBER.

The content of an attribute value literal, on the other hand, is a sequence of replaceable character data surrounded by LIT/LITA delimiters (double and single quotes, respectively, in the reference concrete syntax).

attribute value literal =
   ( LIT , replaceable character data *, LIT) | 
   ( LITA , replaceable character data *, LITA)

Replaceable character data is "like CDATA except that entity references and character references are recognized" (Goldfarb, the SGML Handbook).

It follows that the replacement of entity references in attribute value literals do not depend on the declared value of the attribute. Therefore, if you have <!ENTITY foo "bar"> and <elem attr="&foo;"> the entity reference &foo; will be parsed in the context of replaceable character data (LIT recognition mode), yielding <elem attr=bar>. It does not matter if attr is declared as CDATA, NAME or whatever.

Update

There is no need to say that entities in an attribute have to be parsed, because all attribute types have the same parsing rules: if the attribute value starts with a quote (LIT), then entities are recognized (replaceable character data) and the value ends when a matching end-quote is found.

Here CDATA means that a valid attribute must contain arbitrary character data after expanding entities. Had the attribute been declared as NUMBER, it would have been required to contain numeric characters (or entities that are expanded to numeric characters).

In the example above, the CDATA attribute with value "&foo;" is equivalent to "bar", in the same way that a NUMBER attribute with value "0" is equivalent to "0" (even though the sequence "0" contains characters other than numeric).

My Point is that since CDATA already has a meaning in the context of parsing , why not use a new name for attribute definition? This second use of CDATA(in case of attribute definition) seems ambiguous to its use in the 1st situation(element definition : CDATA elements are not parsed)... — nikel, Dec 23 '13 at 13:29
I understand your point. The question on why the standard uses the same keyword on both places should be asked to the SGML authors. We, mere mortals, can only elaborate on how this choice is consistent with other uses of CDATA. — Javier, Dec 23 '13 at 15:12
That's the reason I asked , maybe there's something that I don't know which may explain the consistency. — nikel, Dec 25 '13 at 10:20
IMHO the consistency of such naming is explained in my answer above. — Javier, Dec 26 '13 at 15:00
I understand that CDATA in terms of attributes means something different. The question was regarding using the same word in both the places. Perhaps, it would have been best if instead of CDATA Sections...they were called NPCDATA sections...(non parsed character data sections)... — nikel, Jan 02 '14 at 17:59
@nikel - If you are going to suggest a term replacement, then `NPCDATA` is too much typing, as `NP` is implied by the lack of a `P`. :) Better to suggest that the attribute declaration use the keyword `RCDATA` to stand for 'Replaceable character data' as Goldfarb defines it. -- But still, it is way too late to suggest even this to the SGML people would could have done something about it. — Jesse Chisholm, Aug 07 '15 at 20:42

Daniel Haley · Answer 2 · 2014-01-02T18:20:10.937

0

A CDATA section, like you would use in an element, is different from the CDATA attribute type.

The parsing that you are most likely observing (like entity references being resolved) is from attribute-value normalization.

edited Jan 02 '14 at 18:20

answered Dec 09 '13 at 16:40

Daniel Haley

51,389
6
69
95

This seems kind of ambiguous to me. The way this CDATA attribute type works is like the PCDATA type for element definition in a DTD. Why was the same name CDATA used, would not PCDATA have been better? – nikel Dec 09 '13 at 16:50
@nikel - I don't know why `CDATA` was used for attributes instead of `#PCDATA`. If there are differences, I'm not sure what they are. – Daniel Haley Dec 09 '13 at 17:37
I think of `PCDATA` as something that modifies the document's actual structure whereas `CDATA` is arbitrary text. Using this definition I think attributes as `CDATA` makes sense. Attributes and sections have different rules for escaping things within their `CDATA` but they both ultimately represent a string that doesn't change the structure (except for existing in the first place). – Chris Haas Dec 09 '13 at 17:48
What do you exactly mean by "changing the actual structure of a document" ? – nikel Dec 10 '13 at 15:16
1

@nikel - Please add another answer instead of editing mine. Your edit is a completely different answer. – Daniel Haley Jan 02 '14 at 18:22
@DanielHaley - re: attributes with `#PCDATA` - this would imply you were allowed to have full markups in an attribute value, like `
` and that is not allowed. I agree it was strange of the SGML authors to choose `CDATA` (instead of `RCDATA` for `Replaceable CDATA` which is how Goldfarb defined it) for attribute declarations. But they did, and we must now live with it. Fortunately, it only hurts your brain for about a day, then you figure it out and move on, living with a bit of context sensitive ambiguity in your life. :) – Jesse Chisholm Aug 07 '15 at 20:48
@nikel - re: Chris Haas' comment about `changing the structure` -- I am guessing he means `adding more nodes to the DOM`. An attribute is a node, but its value cannot add any other nodes. A CDATA section is a node, but its contents cannot add any other nodes. Whereas PCDATA is capable of adding an arbitrary number of nodes to the DOM. – Jesse Chisholm Aug 07 '15 at 20:54

PCDATA vs CDATA in XML DTD

2 Answers2