Convert webpage into plain text..?

Question

I am trying to convert the webpage into a plain text. But if I encountered the table I am getting td and tr tags too. If I replace those table tags then I can't get some of the content.

Here is my code

string s = Regex.Replace(htmldoc, "<script.*?</script>", "", RegexOptions.Singleline | RegexOptions.IgnoreCase);
s = Regex.Replace(s, "<!--.*?-->", "", RegexOptions.Singleline | RegexOptions.IgnoreCase);
s = Regex.Replace(s, "<style.*?style>", "", RegexOptions.Singleline | RegexOptions.IgnoreCase);
s = Regex.Replace(s, "<a.*?a>", "", RegexOptions.Singleline | RegexOptions.IgnoreCase);
s = Regex.Replace(s, "<img.*?img>", "", RegexOptions.Singleline | RegexOptions.IgnoreCase);
s = Regex.Replace(s, "<table.*?table>", "", RegexOptions.Singleline | RegexOptions.IgnoreCase);
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(s);
s = doc.DocumentNode.SelectSingleNode("//body").InnerText.Trim();

Please check it and tell me how can I get the contents from table without getting td and tr tags.

I'll put the obligatory warning that it's not wise to use [regex to parse XML/HTML](http://stackoverflow.com/questions/2400623/if-youre-not-supposed-to-use-regular-expressions-to-parse-html-then-how-are-htm). Your problem shows this well, the tree-structure of HTML table tags makes it hard to deal with in regex. — Matt D, Jul 08 '11 at 16:53
Possible duplicate: http://stackoverflow.com/questions/731649/how-can-i-convert-html-to-text-in-c — Anderson Green, May 10 '13 at 20:13

score 1 · Accepted Answer · edited May 23 '17 at 11:51

1

If you are using HTML Agility pack to parse the table you don't need to remove the HTML tags with your regex. There are some good examples of parsing tables using HTML Agility pack here on SO. ex: HTML Agility pack - parsing tables

edited May 23 '17 at 11:51

Community

1
1

answered Jul 08 '11 at 15:17

Tim

1,276
11
23

score 1 · Answer 2 · answered Jul 08 '11 at 15:19

You can use the body's InnerText:

string html = @"
<html>
    <title>title</title>
    <body>
           <h1> The wheel.</h1>
           Stop reinventing the wheel ! Use powerful APIs 
           for manipulating html docs !
           <h3> I am fine </h3>
           <img src=""da_wheel_in_my_mind.png""/>
    </body>
</html>";

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
string text = doc.DocumentNode.SelectSingleNode("//body").InnerText;

Next, you may want to collapse spaces and new lines:

text = Regex.Replace(text, @"\s+", " ").Trim();

Note, however, that while it is working in this case, markup such as hello<br>world or hello<i>world</i> will be converted by InnerText to helloworld - removing the tags. It is difficult to solve that issue, as display is often determined by the CSS, not just by the markup.

That already I did .. but my question is how to parse table...? if a table contain another table and you dont know hom many table it contain then how you will get the inner texts — Ajit Hegde, Jul 08 '11 at 15:40
There is something I don't understand. How do you load your `htmldoc` variable? — Stephan, Jul 08 '11 at 16:00
Webclient wb=new WebClient();htmldoc=wb.downloadstring(querry); — Ajit Hegde, Jul 08 '11 at 18:34

Convert webpage into plain text..?

2 Answers2