How to remove character entitiy numbers from text

Question

when I get data from the feed and pull the content through the regex I still have (&o#8230; , &o#8211;, &o#8220, etc...[I added the o in the first 2 so they would reformat]) in my content text. Sad thing is that these are also in the source of the content of the feed. Any regex for that I tried something myself but no success: &#[0-9]{4};

My code:

protected override void OnNavigatedTo(System.Windows.Navigation.NavigationEventArgs e)
    {
      
        try
        {        
                          
            SyndicationItem sItem = IsolatedStorageSettings.ApplicationSettings["postovi"] as SyndicationItem; //stores the user chosed item to be displayed
            List <string> CC_List =  IsolatedStorageSettings.ApplicationSettings["ContentList"] as List<string>; //title and content are pulled from feed and put in list

            PageTitle.Text = sItem.Title.Text; 
            PageTitle.FontSize = 40;

            foreach (var item in CC_List)
            {
                int i;
              
                if (item == PageTitle.Text)
                {
                    i = CC_List.IndexOf(item, 0); //index naslova u listi
                    String content = CC_List[i + 1];
                    content = Regex.Replace(content, @"(?<startTag><\s*script[^>]*>)(?<content>[\s\S]*?)(?<endTag><\s*/script[^>]*>)", string.Empty);
                    Match link = Regex.Match(content, @"(?<=<img\s+[^>]*?src=(?<q>['""]))(?<url>.+?)(?=\k<q>)", RegexOptions.Singleline);
                    content = Regex.Replace(content, @"(?></?\w+)(?>(?:[^>'""]+|'[^']*'|""[^""]*"")*)>", string.Empty);
                    content = Regex.Replace(content, "&nbsp;", string.Empty);
                    Uri uri = new Uri(link.Value);
                    slika_clanak.Source = ImageFromUri(link.Value); // gets image
                    content = Regex.Replace(content, @"<p>.*</p>", string.Empty);
                    
                    clanak_textblock.Text = content.Trim(); // reads article text and puts it on screen
                                            
                }
              
            }

You might try http://htmlagilitypack.codeplex.com/. – Derek Beattie Apr 27 '12 at 12:37 — Derek Beattie, Apr 27 '12 at 12:37

score 2 · Answer 1 · answered Apr 27 '12 at 12:35

2

Have you tried HttpUtility.HtmlDecode method? This is standard included in the System.Net assembly, I can't exactly say whether it is available on WP7 as well or not.

answered Apr 27 '12 at 12:35

Styxxy

7,462
3
40
45

Based on http://stackoverflow.com/questions/2573290/httputility-urlencode-in-windows-phone-7 it looks like System.Net.HttpUtility is where it's located. Typically System.Web namespace isn't utilized in Silverlight/WP7 to my knowledge but it's good that they included what they realized we would need. – w0rd-driven Apr 27 '12 at 12:45

score 0 · Answer 2 · answered Apr 27 '12 at 14:18

0

Despite my comment, I realized a second option could be the Html Agility Pack which has a wp7.5 binary found here. You may run into the issue posted here on SO and echoed by this post http://htmlagilitypack.codeplex.com/discussions/282469 to include certain libraries for compilation. The reason I mention it is there's a very beefy HtmlEncode class that builds a dictionary of all the entities. You may not be able to use DeEntitize() directly but you can study how it works to build something to strip everything out if you need.

I personally wouldn't want to work out the regex by hand, I would use something like this built for me then loop through everything I thought was relevant. Of course this is the phone so you may be better off stripping on a case by case basis but it becomes difficult if a feed is constantly changing and you don't have enough sample data to build from.

answered Apr 27 '12 at 14:18

w0rd-driven

933
7
13

installed agility pack. will give a shot. but I'm a noob at it, any guidance how to parse content using HAP would be nice. – Goran303 Apr 27 '12 at 19:41
I get this error: Error 1 The type 'System.Xml.XPath.IXPathNavigable' is defined in an assembly that is not referenced. You must add a reference to assembly 'System.Xml.XPath, Version=2.0.5.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35'. Written only this of HAP code: HtmlDocument doc = new HtmlDocument(); doc.LoadHtml(content); – Goran303 Apr 27 '12 at 20:21
You need to add that reference. It's something like C:/Program Files/Microsoft SDKS/Silverlight/7.1/bin. Just search your filesystem for System.Xml.Xpath.dll – William Melani Apr 28 '12 at 02:09
yeah. found it. thanks. still my feed isnt 100% html doubt HAP will much help or be able to reformat special signs like ’(; minus ( – Goran303 Apr 28 '12 at 11:22
HAP's strong suit I believe is that it doesn't need strict HTML. The entities dictionary seems to cover everything. It's massive. The hard part for you will likely be taking their internal functions in that code and getting it to strip things out as I believe their functions just convert to/from entities. You may also be introducing something the framework could utilize and I apologize for my lack of intimate knowledge of it. – w0rd-driven Apr 30 '12 at 14:19

How to remove character entitiy numbers from text

My code:

2 Answers2