11

I am writing an application that crawls a group of my web pages. Rather than take the entire source code of the page I'd like to take all of the content and store that and be able to store the page as plain text within a database. The content will be used in other applications and not read by users so there's no need for it to be perfectly human-readable.

At first, I was thinking of using regular expressions, but I have no control over the validity of the web pages and there is a great chance that no regular expression would give me the content.

If I have the source code within a string, how can I turn that string of source code into just the content in C#?

Mike B
  • 12,768
  • 20
  • 83
  • 109
  • Define "just the content"... all the html is content, so you could just store the html. Do you mean "just the text, no markup"? or what? – Marc Gravell Jan 10 '10 at 18:51
  • why dont u "XML" parse them ? , this way you can read the nodes and decide on taking just the content ... however i am not sure if XML parsing can read self-closing tags .. – Madi D. Jan 10 '10 at 18:53
  • XML supports self-closing tags, but unfortunately many so-called HTML documents unfortunately contain many malformed tags. – Eilon Jan 10 '10 at 18:53
  • Pretty much "just the text", although I would disagree that the HTML is content as for me it only serves as structure and it would be meaningless to store it. – Mike B Jan 10 '10 at 18:56
  • @EnderMB - in that case, I've added an example using HTML Agility Pack – Marc Gravell Jan 10 '10 at 18:58

4 Answers4

22

It isn't 100% clear what you want, but I'm assuming you want the text minus markup; so:

string html;
// obtain some arbitrary html....
using (var client = new WebClient()) {
    html = client.DownloadString("http://stackoverflow.com/questions/2038104");
}
// use the html agility pack: http://www.codeplex.com/htmlagilitypack
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
StringBuilder sb = new StringBuilder();
foreach (HtmlTextNode node in doc.DocumentNode.SelectNodes("//text()")) {
    sb.AppendLine(node.Text);
}
string final = sb.ToString();
Marc Gravell
  • 1,026,079
  • 266
  • 2,566
  • 2,900
  • When i use this code for parsing google homepage in search of text all i get is tons of javascript. Anyway to avoid that ? – Win Coder Aug 21 '13 at 09:39
  • 1
    @WinCoder : this is how you remove JavaScrip and CSS from the content of the page: http://stackoverflow.com/questions/13441470/htmlagilitypack-remove-script-and-style – user1892410 Jun 04 '15 at 20:22
8

Please, please do not parse HTML yourself! You cannot use just a standard regex to parse HTML - it's not possible.

There are tons of free libraries out there. One of the best free ones in the world of .NET is the HTML Agility Pack.

HTML Agility Pack supports malformed documents as well, which is something that a regex or other basic parsing such as XML will almost never do.

Eilon
  • 25,582
  • 3
  • 84
  • 102
3

Below function will help to remove all HTML tags, scripts, css, styles from html string and convert it to a plain text. view source

private string GetPlainTextFromHtml(string htmlString)
{
    string htmlTagPattern = "<.*?>";
    var regexCss = new Regex("(\\<script(.+?)\\</script\\>)|(\\<style(.+?)\\</style\\>)", RegexOptions.Singleline | RegexOptions.IgnoreCase);
    htmlString = regexCss.Replace(htmlString, string.Empty);
    htmlString = Regex.Replace(htmlString, htmlTagPattern, string.Empty);
    htmlString = Regex.Replace(htmlString, @"^\s+$[\r\n]*", "", RegexOptions.Multiline);
    htmlString = htmlString.Replace("&nbsp;", string.Empty);

    return htmlString;
}
alin0509
  • 451
  • 4
  • 5
0

I wrote code to strip out the raw text from markup and present it in my article Convert HTML to Text. The code presented is pretty simple and lightweight.

I also wrote a lightweight HTML parser and have posted it on Github as HTML Monkey. This would be a more complete solution and it would be a simple task to convert the parsed markup to get only the text. I'm still working on this project and am looking for feedback on how it works.

Jonathan Wood
  • 65,341
  • 71
  • 269
  • 466