1

I have this simple code to get the title of any page

<?php
    $doc = new DOMDocument();
    @$doc->loadHTMLFile('http://www.facebook.com');
    $xpath = new DOMXPath($doc);
    echo $xpath->query('//title')->item(0)->nodeValue."\n";
?>

It is working fine on all pages that I have tried but not in Facebook.

When I try in Facebook it is not showing Welcome to Facebook - Log In, Sign Up or Learn More, but it is showing Update Your Browser | Facebook.

I think there is a problem with useragent. So is there a way to change the useragent or is there any other solution for this?

Idrizi.A
  • 9,819
  • 11
  • 47
  • 88

3 Answers3

3

You can set the user agent in php.ini, without the need for curl. Just use the below lines before you load the DOMDocument

$agent = "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)";
ini_set('user_agent', $agent);

And then your code:

$doc = new DOMDocument();
@$doc->loadHTMLFile('http://www.facebook.com');
$xpath = new DOMXPath($doc);
echo $xpath->query('//title')->item(0)->nodeValue."\n";
Kamil Khan
  • 385
  • 3
  • 8
2

There is no direct method to change the user agent in DOMDocument. You can use curl to retrieve the html and then pass on to DOMDocument. Here is how to retrieve data from curl

$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$data = curl_exec($ch);
curl_close($ch);

You can pass it to DomDocument using the method below.

$dom = new DomDocument();
$dom->loadHtml($data);
$xpath = new DOMXPath($dom);
echo $xpath->query('//title')->item(0)->nodeValue."\n";
DevZer0
  • 13,433
  • 7
  • 27
  • 51
  • How can I get only the title. `$data` is showing the whole page? – Idrizi.A Aug 15 '13 at 08:12
  • I tried this. It is working good on other pages but again not in facebook. It is showing this error `Notice: Trying to get property of non-object in C:\localhost\htdocs\title\index.php on line 17` – Idrizi.A Aug 15 '13 at 08:28
  • The last one `echo $xpath->query('//title')->item(0)->nodeValue."\n";` – Idrizi.A Aug 15 '13 at 08:29
  • check the source facebook has no `title` tag @Enve – DevZer0 Aug 15 '13 at 08:36
  • I checked from `view-source:https://www.facebook.com` in Google Chrome. There is a title tag `Welcome to Facebook - Log In, Sign Up or Learn More` – Idrizi.A Aug 15 '13 at 08:39
  • that `title` is not a root element it's inside `noscript`. change your query to something like `echo $xpath->query('//*/title')->item(0)->nodeValue."\n";` – DevZer0 Aug 15 '13 at 08:41
0

Facebook probably doesn't want people to scrape their site. What you can do on the other hand is to cURL it, but provide a legitimate user agent (perhaps your own, $_SERVER['HTTP_USER_AGENT'] and then provide that result into DOMDocument.

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'www.facebook.com');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch,CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);

$dom = new DomDocument();
$dom->loadHtml(curl_exec($ch));
silkfire
  • 24,585
  • 15
  • 82
  • 105