2

I am trying to create an RSS feed that will validate using the W3C validator. I keep getting problems from the following URLS containing the characters £, ` or -

Here are the URLs:

http://www.example.co.uk/news/2012/april/stamp-rationing-–-why-the-royal-mail-are-ripping-you-off

Here is the error:

This feed does not validate. line 14, column 119: link must be a full and valid URL: http://www.example.co.uk/news/2012/april/stamp-rationing-–-why-the-royal-mail-are-ripping-you-off [help] ... –-why-the-royal-mail-are-ripping-you-off

I have tried replacing the symbols with escape characters but this doesn't work. Here are the escape characters I have been using:

 Text = Text.Replace("-", "&#45");
            Text = Text.Replace("£", "%C2%A");
            Text = Text.Replace("`", "%60");
            Text = Text.Replace("’", "%60");  

Does anyone have any idea how to solve this problem? Here are some more links that are causing me problems:

http://www.example.co.uk/news/2012/march/for-sale-3-bed-detached-london-home-£15,000

Error:

This feed does not validate. line 14, column 106: link must be a full and valid URL: http://www.example.co.uk/news/2012/march/for-sale-3-bed-detached-london-home-£15,000 [help] ... -sale-3-bed-detached-london-home-£15,000

Funky
  • 12,890
  • 35
  • 106
  • 161

3 Answers3

3

You will need to URL encode the URLs before posting them in the RSS:

var encoded = HttpUtility.UrlEncode(aUrl);

Note that the URLs will not be usable directly as :, / etc will also get encoded.

If you want the values of these to be valid XML, use SecurityElement.Escape instead.

var escaped = SecurityElement.Escape(aUrl);
Oded
  • 489,969
  • 99
  • 883
  • 1,009
0

I'm building an API for my system, and I've been using some stuff to normalize the fields. Try filtering this with PHP:

$value = preg_replace('/[^a-z]/i', '', $value);
$value = preg_replace('/[^\x09\x0A\x0D\x20-\x7F]/e', '"&#".ord($0).";"', $value);
$value = htmlentities($value, ENT_NOQUOTES, 'UTF-8', false);
Ivo Pereira
  • 3,410
  • 1
  • 19
  • 24
  • Hi,that looks great but I don't develop in PHP unfortunately, I use C# – Funky Jan 22 '13 at 10:44
  • For the preg_replace part I think you can do something with this: http://stackoverflow.com/questions/166855/c-sharp-preg-replace ; and for htmlentities you maybe would like to check this :) http://stackoverflow.com/questions/1891134/convert-special-chars-to-html-entities-without-changing-tags-and-parameters – Ivo Pereira Jan 22 '13 at 10:50
0

Answer is either to use UTF-8 encoding or convert non-ascii characters to XML entities.

  • UTF-8 encoding: Make sure the document is output in UTF-8, and includes the relevant encoding headers.

    See also UTF-8 encoding xml in PHP

  • Entity encoding: Convert all non ASCII characters to XML entities.

    XML Entities look like this: £ (that one is for the £ sign). Most programming languages will either do this automatically for you as you generate the XML document, or provide standard functions for doing it. You didn't specify the language you're using, but the above should help you find the appropriate API functions.

One thing you should not be doing is generating XML data manually (ie outputting tags and attributes, as strings), or string-replacing the entities manually. You should be using the proper APIs for it. Generating XML (or any other standard data format) manually is always likely to end in problems like this, and does it seem to be a bit crazy to do it the hard way if the tools are right there in front of you to do it properly.

Community
  • 1
  • 1
SDC
  • 14,192
  • 2
  • 35
  • 48