4

I have some random HTML layouts that contain important text I would like to extract. I cannot just strip_tags() as that will leave a bunch of extra junk from the sidebar/footer/header/etc.

I found a method built in Python and I was wondering if there is anything like this in PHP.

The concept is rather simple: use information about the density of text vs. HTML code to work out if a line of text is worth outputting. (This isn’t a novel idea, but it works!) The basic process works as follows:

  1. Parse the HTML code and keep track of the number of bytes processed.
  2. Store the text output on a per-line, or per-paragraph basis.
  3. Associate with each text line the number of bytes of HTML required to describe it.
  4. Compute the text density of each line by calculating the ratio of text t> o bytes.
  5. Then decide if the line is part of the content by using a neural network.

You can get pretty good results just by checking if the line’s density is above a fixed threshold (or the average), but the system makes fewer mistakes if you use machine learning - not to mention that it’s easier to implement!

Update: I started a bounty for an answer that could pull main content from a random HTML template. Since I can't share the documents I will be using - just pick any random blog sites and try to extract the body text from the layout. Remember that the header, sidebar(s), and footer may contain text also. See the link above for ideas.

Xeoncross
  • 55,620
  • 80
  • 262
  • 364
  • What do you mean by "extract" - extract with full intact HTML (like ``), or text only? – Pekka Mar 18 '11 at 19:15
  • 1
    I would not reimplement this. EIther use the python module directly `$text = exec("python -m ...")` or use an online service http://boilerpipe-web.appspot.com/ – mario Mar 18 '11 at 19:16
  • @Pekka, I would rather have the markup elements (like code blocks or object embeds) along with text - but just plain text is fine also. @mario - Thanks! That looks like a good start - but I really need something that I can run locally and I would rather not add Java to my server apps if possible. – Xeoncross Mar 18 '11 at 19:37
  • *(related)* [Best Methods to parse HTML](http://stackoverflow.com/questions/3577641/best-methods-to-parse-html/3577662#3577662) to work with the markup. For the density and stuff you'd have to find some additional tool. – Gordon Mar 22 '11 at 08:11
  • 1
    You're probably looking for something like the Readability algorithm, see this question for more info and implementations: http://stackoverflow.com/questions/1146934/create-great-parser-extract-relevant-text-from-html-blogs – Richard M Mar 24 '11 at 02:46
  • If you only want to look at the "main" content and exclude sidebars, headers and navigation blocks etc., you need to provide some more specific requirements beyond: _"just pick any random blog sites and try to extract the body text from the layout"_ (if you want a good answer that is...) – ridgerunner Mar 24 '11 at 16:35
  • @Richard, that is the best resource I've seen yet. @ridgerunner I don't know the layout of documents I need to process (not willing to go through them all), other wise I would just use the xpath. – Xeoncross Mar 25 '11 at 03:04
  • @Xeoncross: updated once again added extras. ;) test it. – Luca Filosofi Mar 29 '11 at 11:16
  • What solution have you used? I'm trying to make that kind of scraper using php but I'm not sure yet which is the best alternative. – gabitzish Sep 04 '12 at 07:32

5 Answers5

5
  • phpQuery is a server-side, chainable, CSS3 selector driven Document Object Model (DOM) API based on jQuery JavaScript Library.

UPDATE 2

  1. many blogs make use of CMS;
  2. blogs html structure is the same almost the time.
  3. avoid common selectors like #sidebar, #header, #footer, #comments, etc..
  4. avoid any widget by tag name script, iframe
  5. clear well know content like:
    1. /\d+\scomment(?:[s])/im
    2. /(read the rest|read more).*/im
    3. /(?:.*(?:by|post|submitt?)(?:ed)?.*\s(at|am|pm))/im
    4. /[^a-z0-9]+/im

search for well know classes and ids:

  • typepad.com .entry-content
  • wordpress.org .post-entry .entry .post
  • movabletype.com .post
  • blogger.com .post-body .entry-content
  • drupal.com .content
  • tumblr.com .post
  • squarespace.com .journal-entry-text
  • expressionengine.com .entry
  • gawker.com .post-body

  • Ref: The blog platforms of choice among the top 100 blogs


$selectors = array('.post-body','.post','.journal-entry-text','.entry-content','.content');
$doc = phpQuery::newDocumentFile('http://blog.com')->find($selectors)->children('p,div');

search based on common html structure that look like this:

<div>
<h1|h2|h3|h4|a />
<p|div />
</div>

$doc = phpQuery::newDocumentFile('http://blog.com')->find('h1,h2,h3,h4')->parent()->children('p,div');
Luca Filosofi
  • 30,905
  • 9
  • 70
  • 77
  • 2
    Awesome, I probably won't be using this - but I started another bounty to give you some credit for your work since I'm sure others will be able to use it. – Xeoncross Mar 28 '11 at 16:35
3

Domdocument can be used to parse html documents, which can then be queried through PHP.

Edit: wikied

Pedro
  • 1,001
  • 11
  • 16
  • Yes, I'm currently experimenting with both it and regex and having some good luck. If you disabled `E_WARNING` errors in PHP and load content in via `loadHTML()` you can even parse invalid HTML pretty well. – Xeoncross Mar 21 '11 at 22:15
  • After building both a Regex HTML parser (75% success rate) and a DOM parser (90% success rate). I'm going to have to award this basic answer the bounty if no one wants to provide an example of some way to parse HTML. For anyone that cares, it's worth noting that parsing HTML with my ~10 regex rules is 10x faster than using PHP DOM. However, PHP DOM uses 25% less RAM do to all the extra matches arrays I had to create with the preg functions. – Xeoncross Mar 24 '11 at 15:39
  • Could you please provide us the DOM parser solution you coded? – Alp Mar 27 '11 at 19:58
2

I worked on a similar project a while back. It's not as complex as the Python script but it will do a good job. Check out the Simple HTML PHP Parser

http://simplehtmldom.sourceforge.net/

Cogicero
  • 1,514
  • 2
  • 17
  • 36
  • That is just a plain DOM parser which requires you to know the layout to find what you need. These are arbitrary HTML files I'm working with so their structure is often very different. – Xeoncross Mar 18 '11 at 19:34
  • @Xeon not necessarily: You could walk through each element and check its `textNode` value (or whatever the name of the text node is in simpleHTMLDOM). If it matches your search pattern, pull out the whole element including children. That is the only way I can think of... however there are alternatives to SimpleHTMLDOM, see http://stackoverflow.com/questions/3577641/best-methods-to-parse-html/3577662#3577662 – Pekka Mar 18 '11 at 19:40
  • Thanks Pekka. True, Xeoncross, you can walk through the entire document, fetch the element with its children, and even possibly run recursive parses. – Cogicero Mar 18 '11 at 19:44
  • Great list of alternatives from the SO link, Pekka. :) – Cogicero Mar 19 '11 at 20:17
  • Fetching an element and it's children is not as easy as it sounds. I would be happy to award this as an answer though if a working example could be provided. – Xeoncross Mar 20 '11 at 20:35
  • Ok Xeoncross. Can we have a sample of the HTML file to create and test the working example? – Cogicero Mar 21 '11 at 12:46
  • @ Congicero Since I can't share the documents I will be using - just pick any random site and try to extract the body text from the layout. My documents often have very different DOM schema's so any sites you want to try will be fine. Or if you outline what to look for when diving into the DOM I can try to build my own. – Xeoncross Mar 21 '11 at 15:51
1

Depending on your HTML structure and if you have id's or classes in place you can get a little complicated and use preg_match() to specifically get any information between a certain start and end tag. This means that you should know how to write regular expressions.

You can also look into a browser emulation PHP class. I've done this for page scraping and it works well enough depending on how well formatted the DOM is. I personally like SimpleBrowser
http://www.simpletest.org/api/SimpleTest/WebTester/SimpleBrowser.html

Jamie Taniguchi
  • 387
  • 4
  • 13
  • Browser written in PHP eh? Interesting idea. As for the regex, the problem I encountered was that, while searching for text is easy on some documents with a continuous layout, other documents that have a bunch of junk between sections are harder to catch. – Xeoncross Mar 23 '11 at 15:44
  • If you're parsing dynamically changing documents any method use will never catch all the data you want so you'll have to keep tweaking on a per-document basis. If you can find a common denominator between all documents such as an id of #content that'll make things much easier. Using regular expressions with preg_match can get tedious to write and tweak. SimpleBrowser will let you find any element and traverse its children making it a lot easier to tweak so long as you know the documents DOM. The more specific the better but you can target an element even if it doesn't have a class or id. – Jamie Taniguchi Mar 23 '11 at 19:56
  • yes, there are a lot of differences between documents. Here are a couple similarities I discovered though. All documents have main content before comments (if comments exist). All content is generally a large portion of the parent div (when parsing with DOM). The start of text content is generally a higher ration of text to html (when parsing with regex) though there can be samples, video embeds, and code in between the sections of text. – Xeoncross Mar 24 '11 at 15:34
1

I have developed a HTML parser and filter PHP package that can be used for that purpose.

It consists of a set of classes that can be chained together to perform a series of parsing, filtering and transformation operations in HTML/XML code.

It was meant to deal with real world pages, so it can deal with malformed tag and data structures, so it can preserve as much as the original document as possible.

One of the filter classes it comes with can do DTD validation. Another can discard insecure HTML tags and CSS to prevent XSS attacks. Another can simply extract all document links.

All those filter classes are optional. You can chain them together the way you want, if you need any at all.

So, to solve your problem, I do not think there is already a specific solution for that in PHP anywhere, but a special filter class could be developed for it. Take a look at the package. It is thoroughly documented.

If you need help, just check my profile and mail me and I may even develop the filter that does exactly what you need, eventually inspired in any solutions that exist for other languages.

mlemos
  • 1,235
  • 13
  • 21