PHP HTML DOM: How to select all visible/readable text?

Question

I'm trying to scrape websites, modify all visible text (meaning: links, paragraphs, headlines, etc) by keeping the html structure and then render the 'new' page afterwards.

Basically I want to scramble all readable text without destroying the design/functionality.

I tried it with Zend_Dom_Query, but how to select just text?

    $dom = new Zend_Dom_Query($html);
    $results = $dom->query( ??? );

Or is there another/better way of doing this?

Thanks a lot in advance.

Example

Input:

<html>
  <head>....</head>
  <body>

    <div>
      <h1>Headline</h1>
      <h2>Subheadline</h2>
      <p>Some text</p>
      <a href="...">
        A Link 
        <img src="..." />
        <span style="display:none">additional text</span>
      </a>  
    </div>

  </body>
</html>

Output:

<html>
  <head>....</head>
  <body>

    <div>
      <h1>Hinladee</h1>
      <h2>Suialebdhne</h2>
      <p>Smoe txet</p>
      <a href="...">
        A Lnik 
        <img src="..." />
        <span style="display:none">anodiaditl txet</span>
      </a>  
    </div>

  </body>
</html>

Sorry if my description wasn't clear enough. The website layout and the html structure shouldn't be affected. If an element is visibility:hidden or display:none doesn't matter. I'll updated my post with an example. — Mayko, Jul 06 '11 at 21:27
@Makyo the deleted answer by Yoshi had the answer. Try with `//text()` for XPath to get all the DOMText Nodes in the document. — Gordon, Jul 06 '11 at 21:42

score 1 · Answer 1 · answered Jul 06 '11 at 07:38

1

You can try this service: http://www.alchemyapi.com/api/text/ - its API provides easy-to-use mechanisms to extract page text and title information from any web page. It's a simple way. Other way is to use http://www.alchemyapi.com/api/scrape/

answered Jul 06 '11 at 07:38

silex

4,312
4
22
27

Sorry mate, I dont want to rely on an API. – Mayko Jul 06 '11 at 21:43
No problem, look at this similar "standalone" project: https://github.com/feelinglucky/php-readability – silex Jul 07 '11 at 06:41

score 0 · Accepted Answer · answered Jul 07 '11 at 22:38

0

Solution:

Thanks to @Yoshi and @Gordon. This is exactly what I was looking for:

$dom = new Zend_Dom_Query($html);
$results = $dom->query("//text()");

answered Jul 07 '11 at 22:38

Mayko

429
1
5
16

PHP HTML DOM: How to select all visible/readable text?

2 Answers2