0

I have an XSL that I use to render RSS feeds. I want to detect whether the <description> element of an item starts with <![CDATA[ - if so, the <description> content should not be rendered. If it doesn't start with <![CDATA[ then it can be rendered.

But I can't seem to match <![CDATA.

Here's an example RSS feed:

<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="pretty-feed.xsl"?>
<rss version="2.0">
  <channel>
    <title>My Blog</title>
    <link>http://example.com/</link>
    <description>My Blog description</description>
    <item>
       <title>My Blog Post</title>
       <link>http://example.com/2002/09/01/my-post/</link>
       <description>Content of the post.</description>
    </item>
  </channel>
</rss>

And here's part of my pretty-feed.xsl file, showing the relevant part:

<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd">
  <xsl:output method="html" version="1.0" encoding="UTF-8" indent="yes" />
  <xsl:template match="/">
    <html xmlns="http://www.w3.org/1999/xhtml">
      <head>
      <body>
        <xsl:for-each select="/rss/channel/item">
          <xsl:if test="not(starts-with(normalize-space(description), '&lt;![CDATA['))">
            <p>
              <xsl:value-of select="description" />
            </p>
          </xsl:if>
        </xsl:for-each>
      </body>
    </html>
  </xsl:template>
</xsl:stylesheet>

This always renders the <description>, whatever it starts with. So I guess the <![CDATA[ isn't "seen" by the XSL as characters? Is there a way I can detect whether it exists or not?

Phil Gyford
  • 13,432
  • 14
  • 81
  • 143
  • 1
    No, XSL doesn’t see it as those characters since it’s part of the XML structure. As far as I know there’s no way for XSL to determine if there’s CDATA there since the text is extracted beforehand and it’s given as is to anything wanting the contents of the node – Sami Kuhmonen Oct 23 '21 at 14:51
  • Ah, just as I feared. Thanks all the same @SamiKuhmonen. – Phil Gyford Oct 23 '21 at 15:15
  • What is the actual problem you are trying to solve by this? – michael.hor257k Oct 23 '21 at 15:39
  • Some tree models like DOM allow you to distinguish plain text nodes and CDATA section nodes but XPath/XSLT operate on an XDM tree that only has text nodes, independent of the lexical markup. – Martin Honnen Oct 23 '21 at 16:22
  • @michael.hor257k I was hoping to come up with one `.xls` file that could be used to render feeds and display their `` if it's plaintext, but not if it contains CDATA (which *often* indicates it contains an entire blog post etc). – Phil Gyford Oct 23 '21 at 16:36
  • I am not sure what the significance of containing another blog is, but I would think it could be detected by looking at the actual content of `description`? Hard to tell for sure without an example that actually contains one. Perhaps all you need is to look for the presence of the `<` character. – michael.hor257k Oct 23 '21 at 16:58
  • Which XSLT processor do you use, how do you use it? – Martin Honnen Oct 23 '21 at 17:26
  • @michael.hor257k I only wish to display a brief plaintext piece of text. Sometimes weblogs' RSS feeds use `description` to contain that, but some other weblogs fill it with an entire blog post, including HTML, wrapped in the CDATA markers. Both are valid, but only one of which I want to display. – Phil Gyford Oct 24 '21 at 09:10
  • @MartinHonnen I have no idea what processor I'm using, sorry. I'm viewing an RSS feed in my browser, and that feed links to an XSL file as shown above, in order to display the RSS feed in a more "friendly" manner. – Phil Gyford Oct 24 '21 at 09:11
  • Inside the browser you usually deal with an XSLT 1.0 processor (Transformiix for Mozilla, libxslt for Chrome and Edge), so I asked because your XSLT code says `version="3.0"` which seemed to suggest you might run outside of the normal browser based, `` triggered XSLT. – Martin Honnen Oct 24 '21 at 09:22
  • @PhilGyford If this is an issue of which feed you are viewing, you could branch by the channel's title or link. If you don't have a list (or a blacklist) of channels, then you must find some other property to go by - e.g. the length of the text in `description` or (as I already suggested) some recognizable string. -- P.S. See here how to identify your processor: https://stackoverflow.com/a/25245033/3016153 – michael.hor257k Oct 24 '21 at 11:10
  • @MartinHonnen Ah, thanks - I'm adapting this file from one someone else wrote, for the same purpose, and they set it as `version="3.0"`. I assume then I should change this to `version="1.0"`? – Phil Gyford Oct 25 '21 at 08:55
  • @michael.hor257k This is a file that other people can use to "prettify" their own RSS feeds, so *I* don't know what format they will be in. As you suggest, checking for the length of the `description` seems like the best idea. – Phil Gyford Oct 25 '21 at 08:57

1 Answers1

0

You can't. In the XML data model used by XSLT, CDATA is treated as irrelevant, rather like the whitespace between attributes within a start tag. The elements <z>&amp;&amp;&amp;</z> and <z><![CDATA[&&&]]></z> are considered 100% equivalent; there is no distinction between them, and both simply have a string value of "&&&".

If the document design is using CDATA tags to convey information, then it needs to be redesigned.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164