Linq XML How to ignore html code?

Question

I am using Xelement - Linq to XML to parse some an RSS feed.

Rss Example:

    <item>
      <title>Waterfront Ice Skating</title>
      <link>http://www.eventfinder.co.nz/2011/sep/wellington/wellington-waterfront-ice-skating?utm_medium=rss</link>
      <description>&lt;p&gt;An ice skating rink in Wellington for a limited time only! 

Enjoy the magic of the New Zealand winter at an outdoor skating experience with all the fun and atmosphere of New York&amp;#039;s Rockefeller Centre or Central Park, ...&lt;/p&gt;&lt;p&gt;Wellington | Friday, 30 September 2011 - Sunday, 30 October 2011&lt;/p&gt;</description>
      <content:encoded><![CDATA[Today, Wellington Waterfront<br/>Wellington]]></content:encoded>
      <guid isPermalink="false">108703</guid>
      <pubDate>2011-09-30T10:00:00Z</pubDate>
      <enclosure url="http://s1.eventfinder.co.nz/uploads/events/transformed/190501-108703-13.jpg" length="5000" type="image/jpeg"></enclosure>
    </item>

Its all working fine but the description element has alot of html markup that I need to remove.

Description:

<description>&lt;p&gt;An ice skating rink in Wellington for a limited time only! 

    Enjoy the magic of the New Zealand winter at an outdoor skating experience with all the fun and atmosphere of New York&amp;#039;s Rockefeller Centre or Central Park, ...&lt;/p&gt;&lt;p&gt;Wellington | Friday, 30 September 2011 - Sunday, 30 October 2011&lt;/p&gt;</description>

Could anyone assist with this?

What do you mean to "ignore html code". Do you want to extract the text-only? — KV Prajapati, Oct 15 '11 at 08:21
@AVD Yes, I would like to extract the text only, and ignore the markup. — Rhys, Oct 15 '11 at 08:29
take a look a this link - http://www.dotnetperls.com/remove-html-tags — KV Prajapati, Oct 15 '11 at 08:34

tazyDevel · Accepted Answer · 2011-10-15T10:28:17.423

If it is a RSSFeed why don't you use System.ServiceModel.Syndication, the SyncicationFeed in combination with a XML reader will deal with your XmlEncoded issues

            using (XmlReader reader = XmlReader.Create(@"C:\\Users\\justMe\\myXml.xml"))
            {
                SyndicationFeed myFeed = SyndicationFeed.Load(reader);
                ...
            }

Then remove HTML-Tags with regex as suggested by @nemesv, or use something like this

    public static string StripHTML(this string htmlText)
    {
        var reg = new Regex("<[^>]+>", RegexOptions.IgnoreCase);
        return HttpUtility.HtmlDecode(reg.Replace(htmlText, string.Empty));
    }

score 1 · Answer 2 · edited May 23 '17 at 12:09

1

First you should HtmlDecode the content of the descirptoin with System.Net.HttpUtility.HtmlDecode. This replaces the encoded &lt ;p&gt ; to <p> and then you can remove the HTML tags with regex: Using C# regular expressions to remove HTML tags or with some other HTML parsing library.

edited May 23 '17 at 12:09

Community

1
1

answered Oct 15 '11 at 08:34

nemesv

138,284
16
416
359

1

No, it is XmlEncoded, not HtmlEncoded. Just getting XElement.Value will do, HtmlDecode could go wrong. – H H Oct 15 '11 at 09:03

Linq XML How to ignore html code?

2 Answers2