12

When I come across a broken RSS feed, the usual reason its all blown to pieces is because line 23 says "Sanford & Sons."

The most confusing thing is the fact that if you convert the & into &, all is well, even though your alternative still contains the problem character.

Why does RSS fail at rendering the ampersand (&) character by default?

Makoto
  • 104,088
  • 27
  • 192
  • 230
Sampson
  • 265,109
  • 74
  • 539
  • 565

8 Answers8

14

When a 'raw' & is seen, the interpreter is looking for one of the valid escaped & sequences (such as '&' ). When an invalid sequence is found it throws an error. That's all there is to it.

HoldOffHunger
  • 18,769
  • 10
  • 104
  • 133
Mitch Wheat
  • 295,962
  • 43
  • 465
  • 541
  • 7
    ..because of the XML specification – Ed S. Jun 23 '09 at 00:52
  • 3
    This is no different than asking why you can't use raw < and > in XML text – ironfroggy Jun 23 '09 at 00:56
  • 1
    There's really no further answer for you to seek. – defines Jun 23 '09 at 00:56
  • 1
    @Dustin, well, there is actually. Why they decided to do that rather than fall back on interpreting the & if it's not followed by other expected chars - but I don't expect anybody here to know the insights on those questions. – Sampson Jun 23 '09 at 01:49
  • 1
    @Jonathan: Because trying to cope with invalid input is a *bad idea*. It leads to more people *writing* bad input, which leads to incompatibility as each parser does it differently because bad input isn't part of the standard or it wouldn't *be* bad input. – Zan Lynx Apr 25 '11 at 03:30
6

Because rss is an XML-based format and in xml the ampersand (&) signifies the start of an xml entity. The parser is expecting something else there.

You could argue that it should be smart enough to know that the ampersand in "Sanford & Sons" is just an ampersand. But what about when you really want to show ampersand with text? Is "&pc; some custom (also invalid) entity, or should it interpret that as an ampersand also? What about "&amp;amp;"?

Joel Coehoorn
  • 399,467
  • 113
  • 570
  • 794
4

Because it must be escaped in XML syntax. Same reason here.

http://myst-technology.com/public/item/11878

Ed S.
  • 122,712
  • 22
  • 185
  • 265
3

The & is a remainder of XML's roots in SGML. There the &...; syntax is used to escape all kinds of stuff, even whole documents to embed. Therefore if you want to use a literal "&" you have to escape it. It is the same as using quotes inside strings in any programming language.

There is no use in letting XML do some kind of error correction of the kind "If there is no letter following, output a literal &", because that would break the SGML syntax XML is, as said, based on.

That it is done so in HTML by most browsers is, because they said, that it's better for users to see anything than an SGML parse error. But this opens a whole new box of Pandora of which browser does what kind of error corrections. Look at the HTML5 spec and you'll see what it means to really define error handling. It's lots of text.

One special case: You can include a literal "&" in XML/RSS, if you enclose it in a so-called "CDATA" section. That will look like the following:

<item> <![CDATA[ Smith & Wesson ]]> </item>

Cheers,

Boldewyn
  • 81,211
  • 44
  • 156
  • 212
2

Because RSS is XML, and XML demands certain characters be escaped, such as the ampersand.

Svend
  • 7,916
  • 3
  • 30
  • 45
1

This depends highly on the RSS client, but most likely it's attempting to XML-decode the contents (in your example "Sanford & Sons"). When that happens, & indicates an escaped character. If you don't use &amp; as it decodes, it will try to use the next few characters to complete the escape sequence. Odds are highly likely that it will fail.

Randolpho
  • 55,384
  • 17
  • 145
  • 179
0

Not sure if this helps but when I needed to solve this problem I used the numeric entity ref for an ampersand which is & Running this through the w3c validator passed so I guess it's ok to use this.

Cheers

slarti42uk
  • 451
  • 3
  • 9
0

In PHP, you can solve this problem with html_entity_decode() (Source: PHP.net), like so...

$xml_line =
             '<description>' .
             str_replace(
                 ['<', '>',],
                 ['&lt;', '&gt;',],
                 html_entity_decode($description)
             ) .
             '</description>';

Don't forget that you'll need to swap < and > back to their equivalents so that they don't break the DOM XML.

If you find the equivalent of html_entity_decode() for whatever language you are using, you'll be on your way.

HoldOffHunger
  • 18,769
  • 10
  • 104
  • 133