10

I am looking for some open source framework or algorithm to extract article text contents from any HTML page by cleaning the HTML code, removing garbage stuff, similar to what Pocket (aka Read It Later) software does.

Pocket official webpage: http://getpocket.com/

This question is already available under link: How to extract text contents from html like Read it later or InstaPaper Iphone app? but my requirement is bit different. I want to clean the HTML and extract main contents with images by preserving the font and style (CSS).

Community
  • 1
  • 1
Furqan Safdar
  • 16,260
  • 13
  • 59
  • 93

2 Answers2

17

I would recommend NReadability, together with HtmlAgilityPack

Main text is always in div with id readInner after NReadability transcoded the page.

//** replace this with any url **
string url = "http://www.bbc.co.uk/news/world-asia-19457334";

var t = new NReadability.NReadabilityWebTranscoder();
bool b;
string page = t.Transcode(url, out b);

if (b)
{
    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(page);

    var title = doc.DocumentNode.SelectSingleNode("//title").InnerText;
    var imgUrl = doc.DocumentNode.SelectSingleNode("//meta[@property='og:image']").Attributes["content"].Value;
    var mainText = doc.DocumentNode.SelectSingleNode("//div[@id='readInner']").InnerText;
}
carla
  • 1,970
  • 1
  • 31
  • 44
L.B
  • 114,136
  • 19
  • 178
  • 224
  • sorry, but why use NReadability if you use HtmlAgilityPack? – Rafael Herscovici Sep 02 '12 at 19:49
  • http://stackoverflow.com/questions/4182594/grab-all-text-from-html-with-html-agility-pack – Rafael Herscovici Sep 02 '12 at 19:50
  • The "documentation" for nReadability seems to suggest it is simply a pretty printer for HTML (a function that the HAP also has). – Oded Sep 02 '12 at 19:54
  • @Oded, Would you test the above code before commenting please. I used it before and know what it does. It really does what OP wants (*"cleaning the html code"*) – L.B Sep 02 '12 at 19:57
  • No offence meant - there is a lack of documentation about the library and all _I_ can see in the site suggests it is a pretty printer. I also seem to miss where the OP asks for "cleaning the html code" in the question. – Oded Sep 02 '12 at 19:58
  • This usage of `Transcode` is obsolote (a warning is issued), and the following should be used instead: `var transcoder = new NReadabilityWebTranscoder(); transcoder.Transcode(new WebTranscodingInput(yourUrl));` – OfirD Dec 28 '20 at 23:01
2

Use the HTML Agilty Pack - it is an open source HTML parser for .NET.

What is exactly the Html Agility Pack (HAP)?

This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

You can use this to query HTML and extract whatever data you wish.

Community
  • 1
  • 1
Oded
  • 489,969
  • 99
  • 883
  • 1,009
  • 3
    HAP is cool but I prefer ScrapySharp - build over HTML Agility Pack. It adds support of CSS selectors to HAP. http://nuget.org/packages/ScrapySharp – petro.sidlovskyy Sep 02 '12 at 19:43
  • I tired to use it already but i couldn't obtain desired results with it. Can you guide me how to extract article kind of stuff (i.e. text including sample code with code style and images). – Furqan Safdar Sep 02 '12 at 19:45
  • @petro.sidlovskyy - Nice one! Didn't know about it and will probably start using it when I next need to scrape HTML. – Oded Sep 02 '12 at 19:45
  • 2
    @furqan.safdar - "couldn't obtain desired results" is not very descriptive. You need to give a better definition than that. – Oded Sep 02 '12 at 19:46
  • Actually what i am trying to do is i want to generate an article (PDF document using iText and iTextXmlWorker) by cleaning the html code. I just need to capture main text contents, images, and sample code (if any) with original style CSS. I hope it is clear now. – Furqan Safdar Sep 02 '12 at 19:55
  • 1
    @furqan.safdar - That was already clear. What isn't clear is what problems you were having with that. – Oded Sep 02 '12 at 19:57
  • @Oded - I might not be familiar with its proper usage for my requirement. Can you please suggest some sample code so i can verify it. – Furqan Safdar Sep 02 '12 at 20:03
  • @furqan.safdar - The source download of the HAP comes with a bunch of sample projects that show proper usage. – Oded Sep 02 '12 at 20:05
  • @Oded - OK i will try it again later. – Furqan Safdar Sep 02 '12 at 20:15