0

I am trying to extract text out of html files. Like this page.

https://artkapakistan.wordpress.com/2013/01/08/debunking-the-myth-of-the-artist/

I use HtmlAgilipack to get inner html from entry-content class and then remove the html tags. There seems to be a problem with encoding because i am getting strange characters. ’ and  to be exact. As per my search online, the first one is curly single quote and second one is a non breaking space. I tried to use regex to replace the single and double quotes with no success.

  s1 = Regex.Replace(s1, "’|‘", "'");
  s1 = Regex.Replace(s1, "“|”", "\"");

But I am unable to get them replaced. There seems to be some issue with encoding. I am not that much well versed below regex and string replacements. Can you guys help me solve this issue? I have tried to find out 'fixing unicode issues in c#' with no success. Will be highly grateful for any help in this regard.

EDIT: Following is how I retrieve the innter html and text.

text = document.DocumentNode.SelectSingleNode(postBodyClass).InnerHtml;
                    text = RemoveHTMLTags(text);
                    text = RemoveHTMLPunctuation(text);
public static string RemoveHTMLPunctuation(string input)
        {
            string s1 = input;
            s1 = System.Net.WebUtility.HtmlDecode(s1);
            //replace html left right single double quotation marks
            s1 = Regex.Replace(s1, "€¦", "…");
            s1 = Regex.Replace(s1, "’", "'");
            s1 = Regex.Replace(s1, "€œ|€", "\"");
            //replace unicode right and left quotation marks with straight quotation
            string s2 = s1.Replace("“", "\x201c");
            string s3 = s2.Replace("’", "\x2019");
            string s4 = s3.Replace("”", "\x201d");
            string s5 = s4.Replace("…", "\x2026");
            string s6 = s5.Replace(" ", "");
            s6 = s6.Replace("«", "");
            string s7 = s6.Replace(""", "\"");
            string s8 = s7.Replace("&", "&");
            s8 = Regex.Replace(s8, "&[a-z]+;", "");
            s8 = Regex.Replace(s8, "'", "'");
            //remove non breaking space
            s8 = Regex.Replace(s8, " |Â", "");
            //add missing spaces after punctuation marks
            //s8 = Regex.Replace(s8, "([\\.\\?,;:])(\\w+)", "$1 $2");
            return s8;
        }
        public static string RemoveHTMLTags(string input)
        {
            string s1 = input;
            //remove script tag and everything within.
            s1 = Regex.Replace(s1, "\\<script\\s*[^><]+\\>[^><]*\\</\\s*script\\>", "");
            s1 = Regex.Replace(s1, "\\<\\s*br\\s*/*\\s*\\>", Environment.NewLine);
            //add new line for div p or li tag
            s1 = Regex.Replace(s1, "\\<\\s*/(div|p|li)\\s*\\s*\\>", Environment.NewLine);
            s1 = Regex.Replace(s1, "\\>=", "");
            string s2 = Regex.Replace(s1, "&ldquo;", "\x201c");
            string s3 = Regex.Replace(s2, "\\<[Aa]([^><]+|\\s*)\\>.*\\</\\s*[Aa]\\s*\\>", "");
            string s4 = Regex.Replace(s3, "\\<[^<>]+\\>", "");
            string s5 = Regex.Replace(s4, "\\|", "");
            //replace multiple lines with 1 line
            s5 = Regex.Replace(s5, "(\\r\\n|\\r|\\n){2,}", Environment.NewLine);
            //any annoying text put it here to replace from post text
            //s5 = Regex.Replace(s5, "Copyright (c) 2008 Saadia Malik", "");
            s5 = s5.Trim();
            return s5;
        }
Shakir
  • 343
  • 5
  • 23
  • Please show how you read the HTML into `s1`. – Dmitri Trofimov Apr 15 '16 at 14:37
  • Looks like you can [select the encoding](http://stackoverflow.com/questions/3452343/c-sharp-and-htmlagilitypack-encoding-problem). I would just try some different ones until the weird characters go away. Or follow the link given by the answer in my link where the guy shows you how to detect encodings in the html header. – Quantic Apr 15 '16 at 14:39
  • 1
    [HttpUtility.HtmlDecode](https://msdn.microsoft.com/sv-se/library/7c5fyk1k(v=vs.110).aspx). – Visual Vincent Apr 15 '16 at 14:45
  • Have you tried using `HtmlAgilityPack.HtmlEntity.DeEntitize(string)`? Also is there a specific reason you read the inner _Html_ instead of inner _Text_? – LocEngineer Apr 15 '16 at 15:08
  • The inner text does not preserve lines and paragraphs. Past experience with HtmlAgilitypack has shown that inner html is good and then clean it. Above is how i do it. In last attempt I tried to use these nonsense characters to be replaces with no success. They do replace but double quotes become like this: â" and double dash like this: â"“ – Shakir Apr 15 '16 at 16:54
  • New version of HtmlAgilitypack and force encoding fixed my problem. – Shakir Apr 16 '16 at 09:02

0 Answers0