2

in android Firefox app and safari iPad we can read only main content by "Reader Mode". read more... How to recognize only main content in HTML with PHP?

I need to detect main news like Firefox or safari by php

for example I get news from bbcsite.com/news/123 by this code:

<?php
    $html = file_get_contents('http://bbcsite.com/news/123');
?>

then show only main news without ads and ... like Firefox and safari.

I find fivefilters.org . this site can get content!!!

thank you

Milad Ghiravani
  • 1,625
  • 23
  • 43
  • Please clarify your specific problem or add additional details to highlight exactly what you need. As it's currently written, it’s hard to tell exactly what you're asking. – Kermit Jul 18 '13 at 20:30

5 Answers5

4

A new PHP library named PHP Goose seems to do a very good job at this too. It's pretty easy to use and is Composer friendly.

Here's a usage example given on the actual readme :

use Goose\Client as GooseClient;

$goose = new GooseClient();
$article = $goose->extractContent('http://url.to/article');

$title = $article->getTitle();
$metaDescription = $article->getMetaDescription();
$metaKeywords = $article->getMetaKeywords();
$canonicalLink = $article->getCanonicalLink();
$domain = $article->getDomain();
$tags = $article->getTags();
$links = $article->getLinks();
$movies = $article->getMovies();
$articleText = $article->getCleanedArticleText();
$entities = $article->getPopularWords();
$image = $article->getTopImage();
$allImages = $article->getAllImages();
jhuet
  • 396
  • 2
  • 11
  • How can I get the image src from image object. – Gowri May 11 '18 at 09:15
  • The manual isn't very complete unfortunately. I did find an issue on the GitHub page that explains how to achieve this though : `$image->getImageSrc();` https://github.com/scotteh/php-goose/issues/77#issue-296188474 – jhuet May 12 '18 at 18:42
  • Well i'm sorry but i haven't used it. You'll have to refere to the documentation link i sent you. If you still have a problem by following it, you might want to ask the author of the library. – jhuet May 25 '18 at 15:21
2

Readability.php works pretty well but I've found you get more successful results if you curl for the html content and spoof the user agent. You can also use some redirect forwarding in case the url you are trying to hit is giving you the runaround. Here is what I'm using now slightly modified from another post (PHP Curl following redirects). Hope you find it useful.

function getData($url) {
    $url = str_replace('&amp;', '&', urldecode(trim($url)) );
    $timeout = 5;
    $cookie = tempnam('/tmp', 'CURLCOOKIE');
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1');
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($ch, CURLOPT_ENCODING, '');
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_AUTOREFERER, true);
    curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
    curl_setopt($ch, CURLOPT_TIMEOUT, $timeout);
    curl_setopt($ch, CURLOPT_MAXREDIRS, 10);
    $content = curl_exec($ch);
    curl_close ($ch);
    return $content;
}

Implementation:

$url = 'http://';
//$html = file_get_contents($url);
$html = getData($url);

if (function_exists('tidy_parse_string')) {
    $tidy = tidy_parse_string($html, array(), 'UTF8');
    $tidy->cleanRepair();
    $html = $tidy->value;
}

$readability = new Readability($html, $url);

//...
Community
  • 1
  • 1
1

There is no such built-in function in PHP. I am afraid will have to parse and analyse the HTML document yourself. You will probably need to use some XML parser, the SimpleXML library is a good candidate.

I am not familiar with the "Reader mode" feature you are referring to, but a good starting point would probably be removing all <img> contents. The actual "cleanning" algorithm it uses is certainly not trivial at all, and it seems it is actually implemented as a call to a third party, closed soure, service in Javascript.

RandomSeed
  • 29,301
  • 6
  • 52
  • 87
1

Hooray!!!

I found this source code:

1) create Readability.php

2) create JSLikeHTMLElement.php

3) create index.php by this code:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
    <head>
        <title>!</title>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
    </head>
<body dir="rtl">
<?php
include_once 'Readability.php';


// get latest Medialens alert 
// (change this URL to whatever you'd like to test)
$url = 'http://';
$html = file_get_contents($url);

// Note: PHP Readability expects UTF-8 encoded content.
// If your content is not UTF-8 encoded, convert it 
// first before passing it to PHP Readability. 
// Both iconv() and mb_convert_encoding() can do this.

// If we've got Tidy, let's clean up input.
// This step is highly recommended - PHP's default HTML parser
// often doesn't do a great job and results in strange output.
if (function_exists('tidy_parse_string')) {
    $tidy = tidy_parse_string($html, array(), 'UTF8');
    $tidy->cleanRepair();
    $html = $tidy->value;
}

// give it to Readability
$readability = new Readability($html, $url);
// print debug output? 
// useful to compare against Arc90's original JS version - 
// simply click the bookmarklet with FireBug's console window open
$readability->debug = false;
// convert links to footnotes?
$readability->convertLinksToFootnotes = true;
// process it
$result = $readability->init();
// does it look like we found what we wanted?
if ($result) {
    echo "== Title =====================================\n";
    echo $readability->getTitle()->textContent, "\n\n";
    echo "== Body ======================================\n";
    $content = $readability->getContent()->innerHTML;
    // if we've got Tidy, let's clean it up for output
    if (function_exists('tidy_parse_string')) {
        $tidy = tidy_parse_string($content, array('indent'=>true, 'show-body-only' => true), 'UTF8');
        $tidy->cleanRepair();
        $content = $tidy->value;
    }
    echo $content;
} else {
    echo 'Looks like we couldn\'t find the content. :(';
}
?>
</body>
</html>

in $url = 'http://'; set your site url.

Thank you;)

Milad Ghiravani
  • 1,625
  • 23
  • 43
0

this is to display the whole content if you want more information about this just search in Google about regular expression and how to get value between tags in a html file i will tell you why with a demo :)

first off, when you use function file get contents you will get the file with html code but the server or browser will display it like a page look at this code,

$html = file_get_contents('http://coder-dz.com');
preg_match_all('/<li>(.*?)<\/li>/s', $html, $matches);
foreach($matches[1] as $mytitle)
{
echo $mytitle."<br/>";
}

well what i did here? i get the content of my website is word press i get titles because title they are in a tag of HTML li after that i used regular expression to get the values between this tags.

i hope you get my point because I’m not at English, if you have any question feel free to ask me

Walid Naceri
  • 160
  • 5
  • thank you! but safari and firefox automatically recognize main content news.There's no defined standard tag for main content and we can't use this code for all sites. – Milad Ghiravani Jul 18 '13 at 20:54
  • yes, you are right but, you can use it for most of the website for example, let say a website called news.com when he posts a news and the titles they are in a html tag

    and you just want to get the titles you don't want to get the whole content you have to use this technique to get the values between the tags, and for example twitter he has an other way to get the tweets and so on.....etc
    – Walid Naceri Jul 20 '13 at 15:32