0

I want to make something like readability, which extracts only the article text from any page and removes everything else...

I am using file_get_contents to get a webpage and this works fine.

After I get that, how can I extract out just the main article text using PHP?

Are there any plugins or is there a way to do it?

David19801
  • 11,214
  • 25
  • 84
  • 127

2 Answers2

2

There are many libraries that help you parse HTML, and more than a few questions on SO that cover them (such as this one), but that's not your biggest problem.

Your issue is going to be how to determine what exactly is the main article. You could potentially determine what element has the most <p> tags as children, but there's no reason I can't make a CMS that doesn't use <p> tags at all.

Community
  • 1
  • 1
Jeff Lambert
  • 24,395
  • 4
  • 69
  • 96
1

There are HTML parsers to help with the actual transformation of the content.

The question, as others have stated, is determining what parts are the content. In the absence of globally-adopted pure semantic markup (wouldn't it be wonderful?), you're going to have a series of trial-and-error to support various content from various sites. Depending on how much you want to support and how often it changes, that road can get pretty long.

Scraping data isn't as brown-and-serve and people wish it was.

David
  • 208,112
  • 36
  • 198
  • 279
  • 2
    @David19801: Well, that depends on how advanced of a system this needs to be. In the question you say "any page" which is a pretty broad range. In most cases you can get away with: 1) Store the whole page by default. 2) Develop some pattern matching algorithm to "guess" what parts are useful content. 3) Refine that algorithm against false matches over time as you gather more sample data. 4) For often-scraped sites, develop site-specific filters. You'll never hit the 100% mark. But you may hit the 80% mark, then 80% of the remainder, then 80% of _that_ remainder, and so on in Zeno's fashion. – David Dec 30 '11 at 19:02
  • I will read more about it today and I think this is solvable for my case (large text blocks in 90% empty pages). Thanks. – David19801 Dec 30 '11 at 19:06