Answer
Your web address has special characters in it that need to be URL encoded.
Explanation
First of all, the assumption that...
$og_entry_title
is correct and contains the page title, so no problem here
...is wrong.
This title:
<meta property="og:title" content="تقرير استخباري اميركي: القاعدة تسيطر على غرب العراق | أخبار | DW.COM | 28.11.2006" />
is not the same as this title:
<meta property="og:title" content="TOP STORIES | DW.COM" />
Secondly, most modern browsers are awesome enough to do URL encoding on the fly and still display the special characters in the address bar.
You can see the response headers from the web server for more information.
<?php
$url = 'http://www.dw.com/ar/تقرير-استخباري-اميركي-القاعدة-تسيطر-على-غرب-العراق/a-2251369';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "$url");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_VERBOSE, 1);
curl_setopt($ch, CURLOPT_HEADER, 1);
$response = curl_exec($ch);
// Then, after your curl_exec call:
$header_size = curl_getinfo($ch, CURLINFO_HEADER_SIZE);
echo '
header
------
'.substr($response, 0, $header_size);
The results show that it doesn't recognize the association between the URL and that page:
header
------
HTTP/1.1 301 Moved Permanently
Server: Apache-Coyote/1.1
Location: /
Content-Length: 0
Accept-Ranges: bytes
X-Varnish: 99639238
Date: Thu, 16 Jun 2016 15:42:51 GMT
Connection: keep-alive
HTTP Response Code 301
is a notice to (permanently) redirect to another page. Location: /
indicates that you should just go to the home page. This is a common sloppy practice to just send someone to the home page when they don't know what to do with you.
Curl won't follow redirects by default, which is how we're able to examine the 301 response header. But file_get_contents
will follow redirects, which is why you're getting different content than you expect. (With possible exceptions: there is a bug report where some notice that it doesn't always follow redirects.)
Note that the home page does have content
in its og:description
:
<?php
echo file_get_contents('http://www.dw.com/ar/تقرير-استخباري-اميركي-القاعدة-تسيطر-على-غرب-العراق/a-2251369');
Results in this output:
...
<meta property="og:description" content="News and analysis of the top international and European topics Current affairs and background information on poltics, business, science, culture, globalization and the environment. " />
...
<meta property="og:title" content="TOP STORIES | DW.COM" />
...
Solution
First thing you need to do is rawurlencode
the web address:
$url = rawurlencode($url);
Then realize that rawurlencode
is poorly named because a valid URL will contain the HTML protocol http://
or https://
and could also contain slashes to delimit parts. This is problematic because rawurlencode
will convert colons :
to %3A
and slashes /
to %2F
which makes for an invalid URL like http%3A%2F%2Fwww.dw.com%2Far%2F...
. It should have been named rawurlencode_parts_of_URL
, but they didn't ask me :) And to quote Phil Karlton in their defense:
There are only two hard things in Computer Science: cache invalidation and naming things.
So convert the slashes and colons back to their original form:
$url = str_replace('%3A',':',str_replace('%2F','/',$url));
Finally, the last thing you need to do is send a header to your clients to let them know what kind of font encoding to expect.
header("content-type: text/html; charset=utf-8");
Otherwise, your clients might be reading some gobbledygook that could look something like this:
تقرير استخباري اميركي: القاعدة تسيطر على غرب العراÙ
Final Product
<?php
// let's see error output on screen while in development
// remove these lines for production, and use log files only
error_reporting(-1);
ini_set('display_errors', 'On');
$url = 'http://www.dw.com/ar/تقرير-استخباري-اميركي-القاعدة-تسيطر-على-غرب-العراق/a-2251369';
// URL encode special chars
$url = rawurlencode($url);
// fix colons and slashses for valid URL
$url = str_replace('%3A',':',str_replace('%2F','/',$url));
// make request
$webpage = file_get_contents($url);
$og_entry_title = "";
$og_entry_content = "";
$doc = new DOMDocument;
$doc->loadHTML($webpage);
$meta_tags = $doc->getElementsByTagName('meta');
foreach ($meta_tags as $meta_tag) {
if ($meta_tag->getAttribute('property') == 'og:title') {
$og_entry_title = $meta_tag->getAttribute('content');
}
if ($meta_tag->getAttribute('property') == 'og:description') {
$og_entry_content = $meta_tag->getAttribute('content');
}
}
// set the character set for the client
header("content-type: text/html; charset=utf-8");
// print the results
echo
'$og_entry_title: ' . $og_entry_title
.PHP_EOL.
'$og_entry_content: ' . $og_entry_content;
Results in this output:
$og_entry_title: تقرير استخباري اميركي: القاعدة تسيطر على غرب العراق | أخبار | DW.COM | 28.11.2006
$og_entry_content:
Addendum
If you're looking at your error logs, and you really should always be looking at your error logs when developing, then you'll notice a litany of warnings:
Warning: DOMDocument::loadHTML(): htmlParseStartTag: misplaced <html> tag in Entity, line: 4 in ...
Warning: DOMDocument::loadHTML(): htmlParseStartTag: misplaced <html> tag in Entity, line: 5 in ...
Warning: DOMDocument::loadHTML(): htmlParseStartTag: misplaced <html> tag in Entity, line: 6 in ...
Warning: DOMDocument::loadHTML(): htmlParseStartTag: misplaced <html> tag in Entity, line: 7 in ...
Warning: DOMDocument::loadHTML(): ID topMetaInner already defined in Entity, line: 300 in ...
Warning: DOMDocument::loadHTML(): ID langSelectTrigger already defined in Entity, line: 315 in ...
Warning: DOMDocument::loadHTML(): htmlParseEntityRef: no name in Entity, line: 546 in ...
Warning: DOMDocument::loadHTML(): htmlParseEntityRef: no name in Entity, line: 546 in ...
Warning: DOMDocument::loadHTML(): htmlParseEntityRef: no name in Entity, line: 548 in ...
Warning: DOMDocument::loadHTML(): htmlParseEntityRef: no name in Entity, line: 548 in ...
This is because you're trying to use the DOMDocument class with in-valid HTML and not well-formed XML documents. But this is a topic for a different question.