PHP - Finding the largest body of text on an external web page

Question

Is there a way using PHP to identify the largest body of text on an external website extract it and strip it of its tags.

The ideas is that this technique could allow for the extraction of information without adverts, side bars, headers, footers and widgets. It would run in a Cron at low peak times meaning that load time would not be an issue.

If you are specializing at one website, you can find what XPATH the interesting content is at - and parse it using some DOM parser. Extracting text with some sort of regex and strip_tags is doomed to failure. — MightyPork, Feb 02 '14 at 13:03
If you know the document is well formed, you can use things like the dom or xml parser to break it and find what you want, but otherwise, it's hopeless. — PatomaS, Feb 02 '14 at 13:04
Possible duplicate of http://stackoverflow.com/questions/3652657/what-algorithm-does-readability-use-for-extracting-text-from-urls — Prasanth, Feb 02 '14 at 13:22

Dan Cundy · Answer 1 · 2014-02-02T13:28:34.123

I dont have an answer/code snippet for you, but you should consider researching "screen/web scraping" to capture data. Then using "regular expressions" to count characters and strip tags etc. Using both these you will be able to achieve your end goal. Good luck

Here is a start taken from www.jacobward.co.uk. This will allow you to capture a web page in a variable.

<?php
    // Defining the basic cURL function
    function curl($url) {
        $ch = curl_init();  // Initialising cURL
        curl_setopt($ch, CURLOPT_URL, $url);    // Setting cURL's URL option with the $url variable passed into the function
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); // Setting cURL's option to return the webpage data
        $data = curl_exec($ch); // Executing the cURL request and assigning the returned data to the $data variable
        curl_close($ch);    // Closing cURL
        return $data;   // Returning the data from the function
    }




$scraped_website = curl("http://www.example.com");  // Executing our curl function to scrape the webpage http://www.example.com and return the results into the $scraped_website variable

?>

Web/Screen Scraping Wikipedia

Regular Expressions Webcheatsheet

I have tried it before and found it extremely complicated and thus failed. Good luck

PHP - Finding the largest body of text on an external web page

1 Answers1