3

I'm trying to write a simple RSS feed reader in C# using the XmlReader class. The problem I've run into, is that some feeds use, from what I understand, HTML representations of some characters, such as ’ for apostrophe in the title/description. In fact, a couple newspapers I was looking at had some articles with just a regular old single quote used as an apostrophe and some where it was replaced with 146. I've considered doing string replacements before displaying the title/descrip, but I'd really rather avoid kludging and find a proper solution, if there is one, that will also work for other characters that use a similar format. Any suggestions would be very much appreciated.

Egor
  • 1,622
  • 12
  • 26
  • Possible duplicate of http://stackoverflow.com/questions/122641/how-can-i-decode-html-characters-in-c – DaveShaw Jun 27 '11 at 20:23
  • For example, the globe and mail feed http://www.theglobeandmail.com/pages/rss/ almost always has at least one article with a ‘ or ’ character in the title. Note that I see them when viewing the feed page with just my browser (ie9). – Egor Jun 27 '11 at 23:36
  • @Egor : have you managed to get it worked? Which solution you've used? – sll Oct 31 '11 at 14:18

3 Answers3

0

You can use HttpUtility.HtmlDecode

Bas
  • 26,772
  • 8
  • 53
  • 86
  • 1
    That doesn't decode numeric character references, which is what the poster is asking. – wsanville Jun 27 '11 at 20:28
  • I tried HTMLDecode, but it seems to strip the characters out of the string completely rather than replace it with an apostrophe. It's still a significant improvement, as I'd rather show "wont" than "won’t", so I'll go with this if the other suggestions don't work out. Thank you, helpful post. – Egor Jun 27 '11 at 21:47
0

Are you using built in features under the System.ServiceModel.Syndication whilst reading feeds?

If not - try out this, I belive it should automatically solve issues like I've described:

XmlReader reader = XmlReader.Create(ms);
// Configure XmlReader reader ...
// Create a new Syndication Feed
feed = SyndicationFeed.Load(reader);
SyndicationFeedFormatter formatter;

switch (format)
{
    case FeedFormat.Atom:
        formatter = new Atom10FeedFormatter(feed);
        break;

    default:
    case FeedFormat.Rss:
        formatter = new Rss20FeedFormatter(feed);
        break;
}

foreach (SyndicationItem item in formatter.Feed.Items)
{
     yield return item;
}
sll
  • 61,540
  • 22
  • 104
  • 156
  • 1
    This is actually really useful. I wasn't aware of this namespace, and I think I will, in fact, use it. It will simplify my code and make it more flexible, thank you for the suggestion. Unfortunately, it doesn't solve the issue at hand, as it seems to display the text as-is with the same old ’ and ’ codes. – Egor Jun 27 '11 at 22:45
0

According to the Unicode spec, 146 (0x92) is not an apostrophe, it is the "PRIVATE USE ONE" character.

You probably have some editors pasting content from Word (with smart quotes enabled), which is giving you content in a different encoding (Windows-1252).

You should try to specify the correct encoding ("Windows-1252" or code page 1252), and it may work.

brianary
  • 8,996
  • 2
  • 35
  • 29