Why can't RSS handle the ampersand?

Question

When I come across a broken RSS feed, the usual reason its all blown to pieces is because line 23 says "Sanford & Sons."

The most confusing thing is the fact that if you convert the & into &, all is well, even though your alternative still contains the problem character.

Why does RSS fail at rendering the ampersand (&) character by default?

score 14 · Accepted Answer · edited Dec 15 '21 at 01:04

14

When a 'raw' & is seen, the interpreter is looking for one of the valid escaped & sequences (such as '&' ). When an invalid sequence is found it throws an error. That's all there is to it.

edited Dec 15 '21 at 01:04

HoldOffHunger

18,769
10
104
133

answered Jun 23 '09 at 00:49

Mitch Wheat

295,962
43
465
541

7

..because of the XML specification – Ed S. Jun 23 '09 at 00:52
3

This is no different than asking why you can't use raw < and > in XML text – ironfroggy Jun 23 '09 at 00:56
1

There's really no further answer for you to seek. – defines Jun 23 '09 at 00:56
1

@Dustin, well, there is actually. Why they decided to do that rather than fall back on interpreting the & if it's not followed by other expected chars - but I don't expect anybody here to know the insights on those questions. – Sampson Jun 23 '09 at 01:49
1

@Jonathan: Because trying to cope with invalid input is a *bad idea*. It leads to more people *writing* bad input, which leads to incompatibility as each parser does it differently because bad input isn't part of the standard or it wouldn't *be* bad input. – Zan Lynx Apr 25 '11 at 03:30

Joel Coehoorn · Answer 2 · 2011-04-25T14:10:00.827

Because rss is an XML-based format and in xml the ampersand (&) signifies the start of an xml entity. The parser is expecting something else there.

You could argue that it should be smart enough to know that the ampersand in "Sanford & Sons" is just an ampersand. But what about when you really want to show ampersand with text? Is "&pc; some custom (also invalid) entity, or should it interpret that as an ampersand also? What about "&amp;"?

score 4 · Answer 3 · answered Jun 23 '09 at 00:49

4

Because it must be escaped in XML syntax. Same reason here.

http://myst-technology.com/public/item/11878

answered Jun 23 '09 at 00:49

Ed S.

122,712
22
185
265

1

This link is broken now – 8bitjunkie Jun 04 '20 at 17:49

score 3 · Answer 4 · answered Jun 23 '09 at 19:58

The & is a remainder of XML's roots in SGML. There the &...; syntax is used to escape all kinds of stuff, even whole documents to embed. Therefore if you want to use a literal "&" you have to escape it. It is the same as using quotes inside strings in any programming language.

There is no use in letting XML do some kind of error correction of the kind "If there is no letter following, output a literal &", because that would break the SGML syntax XML is, as said, based on.

That it is done so in HTML by most browsers is, because they said, that it's better for users to see anything than an SGML parse error. But this opens a whole new box of Pandora of which browser does what kind of error corrections. Look at the HTML5 spec and you'll see what it means to really define error handling. It's lots of text.

One special case: You can include a literal "&" in XML/RSS, if you enclose it in a so-called "CDATA" section. That will look like the following:

Cheers,

score 2 · Answer 5 · answered Jun 23 '09 at 01:02

2

Because RSS is XML, and XML demands certain characters be escaped, such as the ampersand.

answered Jun 23 '09 at 01:02

Svend

7,916
3
30
45

score 1 · Answer 6 · answered Jun 23 '09 at 00:52

This depends highly on the RSS client, but most likely it's attempting to XML-decode the contents (in your example "Sanford & Sons"). When that happens, & indicates an escaped character. If you don't use & as it decodes, it will try to use the next few characters to complete the escape sequence. Odds are highly likely that it will fail.

score 0 · Answer 7 · answered Aug 24 '10 at 15:49

0

Not sure if this helps but when I needed to solve this problem I used the numeric entity ref for an ampersand which is & Running this through the w3c validator passed so I guess it's ok to use this.

Cheers

answered Aug 24 '10 at 15:49

slarti42uk

451
3
9

HoldOffHunger · Answer 8 · 2021-12-15T17:13:35.867

In PHP, you can solve this problem with html_entity_decode() (Source: PHP.net), like so...

$xml_line =
             '<description>' .
             str_replace(
                 ['<', '>',],
                 ['&lt;', '&gt;',],
                 html_entity_decode($description)
             ) .
             '</description>';

Don't forget that you'll need to swap < and > back to their equivalents so that they don't break the DOM XML.

If you find the equivalent of html_entity_decode() for whatever language you are using, you'll be on your way.

Why can't RSS handle the ampersand?

8 Answers8

Linked