2

I'm trying to scrape websites, modify all visible text (meaning: links, paragraphs, headlines, etc) by keeping the html structure and then render the 'new' page afterwards.

Basically I want to scramble all readable text without destroying the design/functionality.

I tried it with Zend_Dom_Query, but how to select just text?

    $dom = new Zend_Dom_Query($html);
    $results = $dom->query( ??? );

Or is there another/better way of doing this?

Thanks a lot in advance.


Example

Input:

<html>
  <head>....</head>
  <body>

    <div>
      <h1>Headline</h1>
      <h2>Subheadline</h2>
      <p>Some text</p>
      <a href="...">
        A Link 
        <img src="..." />
        <span style="display:none">additional text</span>
      </a>  
    </div>

  </body>
</html>

Output:

<html>
  <head>....</head>
  <body>

    <div>
      <h1>Hinladee</h1>
      <h2>Suialebdhne</h2>
      <p>Smoe txet</p>
      <a href="...">
        A Lnik 
        <img src="..." />
        <span style="display:none">anodiaditl txet</span>
      </a>  
    </div>

  </body>
</html>
Mayko
  • 429
  • 1
  • 5
  • 16
  • Sorry if my description wasn't clear enough. The website layout and the html structure shouldn't be affected. If an element is visibility:hidden or display:none doesn't matter. I'll updated my post with an example. – Mayko Jul 06 '11 at 21:27
  • 1
    @Makyo the deleted answer by Yoshi had the answer. Try with `//text()` for XPath to get all the DOMText Nodes in the document. – Gordon Jul 06 '11 at 21:42

2 Answers2

1

You can try this service: http://www.alchemyapi.com/api/text/ - its API provides easy-to-use mechanisms to extract page text and title information from any web page. It's a simple way. Other way is to use http://www.alchemyapi.com/api/scrape/

silex
  • 4,312
  • 4
  • 22
  • 27
0

Solution:

Thanks to @Yoshi and @Gordon. This is exactly what I was looking for:

$dom = new Zend_Dom_Query($html);
$results = $dom->query("//text()");
Mayko
  • 429
  • 1
  • 5
  • 16