3

There are a bunch of HTML text extraction tools out there. Mostly for Java or Python. The one I come across most often is boilerpipe. There are a few APIs here and there, and some seem to work pretty well. Does anyone know of anything in PHP that does this?

Bill
  • 5,478
  • 17
  • 62
  • 95
  • Define "html text extraction". Are you looking for [DOM](http://php.net/manual/en/book.dom.php)? – DaveRandom Jul 07 '12 at 22:35
  • No, like you know on IOS they have "reader" and it takes out all of the junk on the site, like adds, navigation etc. and just shows the content so it's easier to read? That's what I mean – Bill Jul 08 '12 at 19:00

2 Answers2

0

You could try phpQuery:

http://code.google.com/p/phpquery/

Austin
  • 6,026
  • 2
  • 24
  • 24
0

DomDocument is a class available in PHP if you have libxml support that can parse HTML documents and let you iterate over them or issue XPath queries to find specific nodes in the DOM tree. This is the ideal method.

Or, if the text is simple enough and uniform, you can use preg_match() to extract text from the data using Regular Expressions.

drew010
  • 68,777
  • 11
  • 134
  • 162
  • 2
    Oooh, living dangerously there. You can get crucified for suggesting that here you know (you know what I'm talking about). How long before a standard link appears...? – DaveRandom Jul 07 '12 at 22:36
  • @DaveRandom :) Yeah I know what you mean. I try to be pragmatic about this kind of thing since sometimes it may work just as well. – drew010 Jul 08 '12 at 02:28