43

Notice how Google News has sources on the bottom of each article excerpt.

The Guardian - ABC News - Reuters - Bloomberg

I'm trying to imitate that.

For example, upon submitting the URL http://www.washingtontimes.com/news/2010/dec/3/debt-panel-fails-test-vote/ I want to return The Washington Times

How is this possible with php?

Noob
  • 541
  • 1
  • 8
  • 12
  • Google news probably manages a look up table for known domains, and perhaps analyzes the HTML for unknown ones. A lookup table should be trivial to implement, so I've submitted an answer that does the latter. – Matthew Dec 03 '10 at 19:18

10 Answers10

66

My answer is expanding on @AI W's answer of using the title of the page. Below is the code to accomplish what he said.

<?php

function get_title($url){
  $str = file_get_contents($url);
  if(strlen($str)>0){
    $str = trim(preg_replace('/\s+/', ' ', $str)); // supports line breaks inside <title>
    preg_match("/\<title\>(.*)\<\/title\>/i",$str,$title); // ignore case
    return $title[1];
  }
}
//Example:
echo get_title("http://www.washingtontimes.com/");

?>

OUTPUT

Washington Times - Politics, Breaking News, US and World News

As you can see, it is not exactly what Google is using, so this leads me to believe that they get a URL's hostname and match it to their own list.

http://www.washingtontimes.com/ => The Washington Times

rybo111
  • 12,240
  • 4
  • 61
  • 70
Jose Vega
  • 10,128
  • 7
  • 40
  • 57
  • Thanks, the code works but how would you get the same main title if say the link was http://www.washingtontimes.com/news/2010/dec/3/obama-makes-surprise-trip-afghanistan/ ? I think that's what AI W suggested – Noob Dec 03 '10 at 19:46
  • You would use parse_url to get the hostname and use `getTitle($host);` instead. – TecBrat Feb 19 '12 at 21:17
  • 1
    any other way than parsing html with regex ? – Wissem May 08 '13 at 15:22
  • The pattern specified here need to be improved. As this code won't work if any attributes sets for title tag. E.g, https://www.facebook.com/ – Malkesh May 09 '13 at 06:52
  • 6
    The regex matching ought to be: `preg_match("/\(.*)\<\/title\>/i",$str,$title);` Some sites have the <title> in all caps, so the check should ignore case. – OldDrunkenSailor Aug 05 '13 at 04:09
  • @Jose, how would you account for http 500 and other header errors. The function breaks if a page returns an error? Can you show how those conditions would be added to the if statement maybe with an if else else etc? – Anagio Sep 03 '13 at 00:49
  • Remember: `file_get_contents()` can work locally, so it could be a security risk, e.g. `file_get_contents('./passwords.txt')`. This function may only return the contents of ``, but it could be used maliciously. – rybo111 May 21 '15 at 21:56
  • Some websites don't allow `file_get_contents()` and produced an `Access Denied` error. I found a work around it by setting this - `ini_set('user_agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.11) Gecko/2009060215 Firefox/3.0.11');` More info [here](http://www.cafewebmaster.com/php-get-page-title-function) – Sarthak Singhal Sep 20 '15 at 13:59
  • 1
    Make sure to make the regex non-greedy since some websites use more than one tag: preg_match("/\<title>(.*?)\<\/title> – R.G. Nov 13 '15 at 09:57
  • When I try to use that function with this page[1] it loads the whole content and not only the title.. [zeit.de article](http://www.zeit.de/digital/datenschutz/2017-05/datenschutz-facebook-whatsapp-uebernahme-eu-kommission-strafe) – Suisse Jul 19 '17 at 00:15
  • This saved a startup. Cheers! – progyammer Nov 13 '17 at 18:23
36
$doc = new DOMDocument();
@$doc->loadHTMLFile('http://www.washingtontimes.com/news/2010/dec/3/debt-panel-fails-test-vote/');
$xpath = new DOMXPath($doc);
echo $xpath->query('//title')->item(0)->nodeValue."\n";

Output:

Debt commission falls short on test vote - Washington Times

Obviously you should also implement basic error handling.

Matthew
  • 47,584
  • 11
  • 86
  • 98
  • @Matthew When I changed the URL to http://facebook.com it is showing "Update Your Browser | Facebook". Is there any solution for this? – Idrizi.A Aug 15 '13 at 07:43
  • @Enve, without looking at it, I would assume it's because they are using a lot of Javascript to generate the page. The "Update Your Browser" is probably the default title. So you're probably out of luck in terms of any simple solution. – Matthew Aug 15 '13 at 15:19
  • Thanks! The accepted answer didn't work for me. It just returned localhost. This answer worked for me :) – Peter Cullen Jan 05 '19 at 08:54
6

Using get_meta_tags() from the domain home page, for NYT brings back something which might need truncating but could be useful.

$b = "http://www.washingtontimes.com/news/2010/dec/3/debt-panel-fails-test-vote/" ;

$url = parse_url( $b ) ;

$tags = get_meta_tags( $url['scheme'].'://'.$url['host'] );
var_dump( $tags );

includes the description 'The Washington Times delivers breaking news and commentary on the issues that affect the future of our nation.'

Cups
  • 6,901
  • 3
  • 26
  • 30
5

You could fetch the contents of the URL and do a regular expression search for the content of the title element.

<?php
$urlContents = file_get_contents("http://example.com/");
preg_match("/<title>(.*)<\/title>/i", $urlContents, $matches);

print($matches[1] . "\n"); // "Example Web Page"
?>

Or, if you don't want to use a regular expression (to match something very near the top of the document), you could use a DOMDocument object:

<?php
$urlContents = file_get_contents("http://example.com/");

$dom = new DOMDocument();
@$dom->loadHTML($urlContents);

$title = $dom->getElementsByTagName('title');

print($title->item(0)->nodeValue . "\n"); // "Example Web Page"
?>

I leave it up to you to decide which method you like best.

James Sumners
  • 14,485
  • 10
  • 59
  • 77
  • 12
    Aaargh! Regexp... for... getting... data... from... HTML – thejh Dec 03 '10 at 19:06
  • @thejh: You don't know in general what kind of HTML pages are out there. I guess DOMDocument may have larger memory footprint than the regexp. (You may exceed PHP memory limit.) This is the case where it is maybe justifiable to use a regex or a simple strpos function. – MartyIX Aug 05 '15 at 10:23
4

I try to avoid regular expressions when it isn't necessary, I have made a function to get the website title with curl and DOMDocument below.

function website_title($url) {
   $ch = curl_init();
   curl_setopt($ch, CURLOPT_URL, $url);
   curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
   // some websites like Facebook need a user agent to be set.
   curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36');
   $html = curl_exec($ch);
   curl_close($ch);

   $dom  = new DOMDocument;
   @$dom->loadHTML($html);

   $title = $dom->getElementsByTagName('title')->item('0')->nodeValue;
   return $title;
}

echo website_title('https://www.facebook.com/');

above returns the following: Welcome to Facebook - Log In, Sign Up or Learn More

TURTLE
  • 3,728
  • 4
  • 49
  • 50
4

PHP manual on cURL

<?php

$ch = curl_init("http://www.example.com/");
$fp = fopen("example_homepage.txt", "w");

curl_setopt($ch, CURLOPT_FILE, $fp);
curl_setopt($ch, CURLOPT_HEADER, 0);

curl_exec($ch);
curl_close($ch);
fclose($fp);
?>

PHP manual on Perl regex matching

<?php
$subject = "abcdef";
$pattern = '/^def/';
preg_match($pattern, $subject, $matches, PREG_OFFSET_CAPTURE, 3);
print_r($matches);
?>

And putting those two together:

<?php 
// create curl resource 
$ch = curl_init(); 

// set url 
curl_setopt($ch, CURLOPT_URL, "example.com"); 

//return the transfer as a string 
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 

// $output contains the output string 
$output = curl_exec($ch); 

$pattern = '/[<]title[>]([^<]*)[<][\/]titl/i';

preg_match($pattern, $output, $matches);

print_r($matches);

// close curl resource to free up system resources 
curl_close($ch);      
?>

I can't promise this example will work since I don't have PHP here, but it should help you get started.

Novikov
  • 4,399
  • 3
  • 28
  • 36
  • 3
    A) Curl is overkill. B) Using regular expressions to parse HTML/XML is generally less reliable than using XPath queries or the DOM. – Matthew Dec 03 '10 at 19:24
  • For traversing a document definitely. However a title tag is simple to extract. Another concern is that XPath is for XML. Assuming that a webpage is well formed XML is a leap of faith, imho. I've only used DOMXPath once and I'm not sure how well it deals with a typical trainwreck of a webpage. – Novikov Dec 03 '10 at 19:31
  • `DOMDocument::loadHTML` will do an adequate job of converting HTML into XML, especially for finding a single tag. Using regexp to find something as simple as a title tag isn't even as trivial as you may think. For instance, yours will fail with `` due to the space. (If the XPath fails, you could always fall back to a regexp.) – Matthew Dec 03 '10 at 19:46
  • Yes, this is true. `'/[<][ ]*title[ ]*[>]([^<]*)/i'` Anything that will break that will most likely break any DOM parser that wasn't designed for use in a web browser. – Novikov Dec 03 '10 at 19:49
  • Hmm.. while CURL works perfectly I agree that I can use something more simplified for retrieving a title. However I also want to avoid webpage errors. I'm in a dilemma.. – Noob Dec 03 '10 at 20:16
1

i wrote a function to handle it:

 function getURLTitle($url){

    $ch = curl_init();

    curl_setopt($ch, CURLOPT_URL, $url);

    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

    $content = curl_exec($ch);

    $contentType = curl_getinfo($ch, CURLINFO_CONTENT_TYPE);
    $charset = '';

    if($contentType && preg_match('/\bcharset=(.+)\b/i', $contentType, $matches)){
        $charset = $matches[1];
    }

    curl_close($ch);

    if(strlen($content) > 0 && preg_match('/\<title\b.*\>(.*)\<\/title\>/i', $content, $matches)){
        $title = $matches[1];

        if(!$charset && preg_match_all('/\<meta\b.*\>/i', $content, $matches)){
            //order:
            //http header content-type
            //meta http-equiv content-type
            //meta charset
            foreach($matches as $match){
                $match = strtolower($match);
                if(strpos($match, 'content-type') && preg_match('/\bcharset=(.+)\b/', $match, $ms)){
                    $charset = $ms[1];
                    break;
                }
            }

            if(!$charset){
                //meta charset=utf-8
                //meta charset='utf-8'
                foreach($matches as $match){
                    $match = strtolower($match);
                    if(preg_match('/\bcharset=([\'"])?(.+)\1?/', $match, $ms)){
                        $charset = $ms[1];
                        break;
                    }
                }
            }
        }

        return $charset ? iconv($charset, 'utf-8', $title) : $title;
    }

    return $url;
}

it fetches the webpage content, and tries to get document charset encoding by ((from highest priority to lowest):

  1. An HTTP "charset" parameter in a "Content-Type" field.
  2. A META declaration with "http-equiv" set to "Content-Type" and a value set for "charset".
  3. The charset attribute set on an element that designates an external resource.

(see http://www.w3.org/TR/html4/charset.html)

and then uses iconv to convert title to utf-8 encoding.

Vikdor
  • 23,934
  • 10
  • 61
  • 84
xianyu
  • 81
  • 5
1

Get title of website via link and convert title to utf-8 character encoding:

https://gist.github.com/kisexu/b64bc6ab787f302ae838

function getTitle($url)
{
    // get html via url
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_AUTOREFERER, true);
    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36");
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    $html = curl_exec($ch);
    curl_close($ch);

    // get title
    preg_match('/(?<=<title>).+(?=<\/title>)/iU', $html, $match);
    $title = empty($match[0]) ? 'Untitled' : $match[0];
    $title = trim($title);

    // convert title to utf-8 character encoding
    if ($title != 'Untitled') {
        preg_match('/(?<=charset\=).+(?=\")/iU', $html, $match);
        if (!empty($match[0])) {
            $charset = str_replace('"', '', $match[0]);
            $charset = str_replace("'", '', $charset);
            $charset = strtolower( trim($charset) );
            if ($charset != 'utf-8') {
                $title = iconv($charset, 'utf-8', $title);
            }
        }
    }

    return $title;
}
Kise Xu
  • 11
  • 1
1

Alternatively you can use Simple Html Dom Parser:

<?php
require_once('simple_html_dom.php');

$html = file_get_html('http://www.washingtontimes.com/news/2010/dec/3/debt-panel-fails-test-vote/');

echo $html->find('title', 0)->innertext . "<br>\n";

echo $html->find('div[class=entry-content]', 0)->innertext;
István Ujj-Mészáros
  • 3,228
  • 1
  • 27
  • 46
  • Hmm I never tried HTML dom Parser. It sure looks simpler. Tho I'm not sure if it takes longer to process compared to other methods – Noob Dec 03 '10 at 19:57
  • @Noob It's much slower than DOMDocument (see [here](http://stackoverflow.com/questions/2735291/domdocument-class-unable-access-domnode/4230472#4230472)), but it runs without any PHP warning on this page (but I recommend [konforce's solution](http://stackoverflow.com/questions/4348912/get-title-of-website-via-link/4349042#4349042) with some error handling). – István Ujj-Mészáros Dec 04 '10 at 11:35
  • @IstvánUjj-Mészáros you can disable PHP warnings using `LIBXML_NOWARNING | LIBXML_NOERROR` options. – Arnas Kazlauskas Aug 12 '18 at 18:05
  • 1
    Example: `@$doc->loadHTMLFile($link, LIBXML_NOWARNING | LIBXML_NOERROR);` – Arnas Kazlauskas Aug 12 '18 at 18:06
0

Simple but it takes some time:

$tags = get_meta_tags('https://google.com');
if (array_key_exists('title', $tags)) {
    # Do something with it
    echo nl2br("Page Title: $tags[title]\n");
}

I haven't tried the proposed answers by others here to compare for performance, but you should do.