2

How should an Atom feed parser handle the following line of XML in a feed:

<title type="html"><![CDATA[Johnson &amp; Johnson]]></title>

For the sake of the discussion, lets assume that the originally intended text was in fact Johnson & Johnson. I came across this online discussion about this issue and there seemed to be 2 different opinions:

1. Opinion #1 - claims that this content is double-encoded. The text "Johnson & Johnson" text has been entity escaped and then encoded again by being wrapped in a CDATA section. He states that a well behaved xml parser will return Johnson &amp; Johnson, because this is how the XML spec states CDATA encoded data should be handled.

  1. Opinion #2 - claims that the Atom spec takes precedent. He states that the CDATA acts as a passthrough. Johnson &amp; Johnson comes out as Johnson &amp; Johnson. If this were just an XML document, it would end there. However, because it is Atom, we must then look at the Atom spec to determine the proper behavior. The atom spec states that any element with the type="html" contains entity escaped html. Therefore, we should be free to decode it.

Which of these factually correct? Should a proper Atom XML parser produce: Johnson & Johnson or Johnson &amp; Johnson given this particular situation?

Community
  • 1
  • 1
mmcdole
  • 91,488
  • 60
  • 186
  • 222

3 Answers3

1

Both opinions are correct:

  • The title encoded as text is Johnson & Johnson.
  • The title encoded as HTML is Johnson &amp; Johnson
  • The title encoded as HTML in XML is <![CDATA[Johnson &amp; Johnson]]>
Alf Eaton
  • 5,226
  • 4
  • 45
  • 50
0

CDATA is character data - completely ignored by the parser between <![CDATA and ]>. It has to be since xml cannot handle &amp;. Therefore, there is no "double encoding" - any parser skips to the close tag, ignoring anything in between. I've not come across a parser that allows actual nesting (embedded full CDATA open and close tags).

Mike
  • 2,721
  • 1
  • 15
  • 20
0

Content between CDATA markers is not parsed for entities of tags, so the parsed value of the text node is Johnson &amp; Johnson.

Note that the attribute says type="html", so it should then be parsed as HTML.

e.g. If you were expressing this as a webpage you might write:

<h1>Johnson &amp; Johnson</h1>

If it had said type="text" then you would have needed to encode the plain text as HTML which would have given you:

<h1>Johnson &amp;amp; Johnson</h1>
Quentin
  • 914,110
  • 126
  • 1,211
  • 1,335
  • Your statement "so it should then be parsed as HTML" is where the ambiguity comes in for me. If the Atom library, typically for `type="html"` elements would html-decode its element's text if there was no CDATA, are you stating that it also should html-decode the element text that is wrapped in a CDATA? – mmcdole Apr 22 '16 at 20:52
  • *If the Atom library, typically for type="html"* — The Atom library shouldn't, the HTML renderer should. – Quentin Apr 22 '16 at 21:06
  • Why shouldn't the feed parser libraries take it upon themselves to unescape elements marked as `type="html"` when the spec clearly states that they are escaped as such? When you ask Python's `feedparser` library, or PHP's `Simplepie` library to return the `Title` from a feed that had it's title defined as: `Example <p>title</p>`, they will all return `Example

    title

    `. This question revolves around whether or not the CDATA element should circumvent that typical unescaping that they do when parsing the data into their unescaped form.
    – mmcdole Apr 22 '16 at 21:51
  • @mmcdole — In that example that HTML is just encoded using entities instead of using CDATA. So yes, CDATA should be treated as CDATA. – Quentin Apr 22 '16 at 22:03