Extracting main text from a page using PHP

Question

I want to make something like readability, which extracts only the article text from any page and removes everything else...

I am using file_get_contents to get a webpage and this works fine.

After I get that, how can I extract out just the main article text using PHP?

Are there any plugins or is there a way to do it?

What do you mean with "main article"? How you identify which is the main? — Aurelio De Rosa, Dec 30 '11 at 18:46
Define "main article text". What criteria do you have to extract that specific text? — Pekka, Dec 30 '11 at 18:46
Try [HTML Purifier](http://htmlpurifier.org/). (Or [Simple HTML DOM](http://simplehtmldom.sourceforge.net/), or heck maybe just [DOM](http://php.net/manual/en/book.dom.php).) — Jared Farrish, Dec 30 '11 at 18:48
@Aurelio and pekka, main article, I mean the main text content, like what readability does...I want to extract that... — David19801, Dec 30 '11 at 18:51

score 2 · Accepted Answer · edited May 23 '17 at 12:07

There are many libraries that help you parse HTML, and more than a few questions on SO that cover them (such as this one), but that's not your biggest problem.

Your issue is going to be how to determine what exactly is the main article. You could potentially determine what element has the most <p> tags as children, but there's no reason I can't make a CMS that doesn't use <p> tags at all.

score 1 · Answer 2 · answered Dec 30 '11 at 18:53

1

There are HTML parsers to help with the actual transformation of the content.

The question, as others have stated, is determining what parts are the content. In the absence of globally-adopted pure semantic markup (wouldn't it be wonderful?), you're going to have a series of trial-and-error to support various content from various sites. Depending on how much you want to support and how often it changes, that road can get pretty long.

Scraping data isn't as brown-and-serve and people wish it was.

answered Dec 30 '11 at 18:53

David

208,112
36
198
279

2

@David19801: Well, that depends on how advanced of a system this needs to be. In the question you say "any page" which is a pretty broad range. In most cases you can get away with: 1) Store the whole page by default. 2) Develop some pattern matching algorithm to "guess" what parts are useful content. 3) Refine that algorithm against false matches over time as you gather more sample data. 4) For often-scraped sites, develop site-specific filters. You'll never hit the 100% mark. But you may hit the 80% mark, then 80% of the remainder, then 80% of _that_ remainder, and so on in Zeno's fashion. – David Dec 30 '11 at 19:02
I will read more about it today and I think this is solvable for my case (large text blocks in 90% empty pages). Thanks. – David19801 Dec 30 '11 at 19:06

Extracting main text from a page using PHP

2 Answers2