1

If we have the following XML element:

<x>a &lt; b</x>

and the another one

<y>a<![CDATA[ < ]]>b</y>

Do both elements x and y have the value of a < b? Is the second example valid, common, recommended or something like that?

AFAK y has three child elements - PCDATA a, CDATA < and PCDATA b and some libraries parse it exactly like that. On the other hand https://pugixml.org/ for one returns only a as value for x (helper function).

Doncho Gunchev
  • 2,159
  • 15
  • 21

1 Answers1

1

There is a fundamental difference between the two:

CDATA means Character Data, while PCDATA means Parsed Character Data, which already gives us a hint into the right direction why parsers may behave differently, depending on their conformance level.

CDATA sections are strict and pure escapes of anything in between the <![CDATA[ and ]]> tags. Nothing, that is written in between here, is supposed to be parsed by the XML processor at all! A conforming XML parser just ignores anything here and passes it through, unseen, to whatever application has requested the XML (which then is free to process it by itself). This is why we can place any wild character data in here, without the XML becoming invalid.

&lt; is an Entity, more specifically a Character Entity. Entities are 'placeholders' or 'markers', that get substituted by content. This means, that an entity is also PCDATA (Parsed Character DATA). It gets parsed by the XML parser, which then interprets it (tries to resolve its contents) so it can substitute the entity with it.

As of the value of the data, we may need to know more about the application, that requests the XML. Within the domain of XML processing tools (XSD, XSLT, XPath, XQuery, etc.), it should come out, in both cases, as any of the XPath datatypes of text(), xs:string() or xs:untypedAtomic, depending on what function you used to gain access to it. For example:

let $t := <xml>Text <![CDATA[test]]> bla.</xml>
return $t/data() instance of xs:untypedAtomic
let $t := <xml>Text <![CDATA[test]]> bla.</xml>
return $t/string() instance of xs:string
let $t := <xml>Text <![CDATA[test]]> bla.</xml>
return $t/text() instance of text()

all result in true.

For any application, that is not working with the XML Data Model, however, the result should be simply the text, that was in between the element tags.

There is some interesting note here and a whole thread concerning this, and related, topics.

amix
  • 133
  • 1
  • 12
  • I get the difference between `CDATA` and `PCDATA`, my question is more about how to handle this - should I write a helper function that joins all the `CDATA` and `PCDATA` children of an element to a big string or just use the first one (like pugixml, which would return `a` as value of `y`) or reject the element as "invalid" or "unsupported" or something. – Doncho Gunchev Nov 01 '20 at 14:42
  • 1
    For anything outside of the XML domain, both express the same: `a < b`. Since I do not understand, what your desired result is (what you do with the data you receive from the XML) I can not recommend anything. Though, in ordinary cases I would just join the string values. – amix Nov 02 '20 at 09:41