How should an Atom feed parser handle the following line of XML in a feed:
<title type="html"><![CDATA[Johnson & Johnson]]></title>
For the sake of the discussion, lets assume that the originally intended text was in fact Johnson & Johnson
. I came across this online discussion about this issue and there seemed to be 2 different opinions:
1.
Opinion #1 - claims that this content is double-encoded. The text "Johnson & Johnson" text has been entity escaped and then encoded again by being wrapped in a CDATA section. He states that a well behaved xml parser will return Johnson & Johnson
, because this is how the XML spec states CDATA encoded data should be handled.
- Opinion #2 - claims that the Atom spec takes precedent. He states that the CDATA acts as a passthrough.
Johnson & Johnson
comes out asJohnson & Johnson
. If this were just an XML document, it would end there. However, because it is Atom, we must then look at the Atom spec to determine the proper behavior. The atom spec states that any element with thetype="html"
contains entity escaped html. Therefore, we should be free to decode it.
Which of these factually correct? Should a proper Atom XML parser produce:
Johnson & Johnson
or Johnson & Johnson
given this particular situation?
title
`. This question revolves around whether or not the CDATA element should circumvent that typical unescaping that they do when parsing the data into their unescaped form. – mmcdole Apr 22 '16 at 21:51