How to get Open Graph Protocol of a webpage by php?

Question

PHP has a simple command to get meta tags of a webpage (get_meta_tags), but this only works for meta tags with name attributes. However, Open Graph Protocol is becoming more and more popular these days. What is the easiest way to get the values of opg from a webpage. For example:

<meta property="og:url" content=""> 
<meta property="og:title" content=""> 
<meta property="og:description" content=""> 
<meta property="og:type" content="">

The basic way I see is to get the page via cURL and parse it with regex. Any idea?

score 50 · Answer 1 · answered Jan 30 '12 at 17:31

Really simple and well done:

Using https://github.com/scottmac/opengraph

$graph = OpenGraph::fetch('http://www.avessotv.com.br/bastidores-pantene-institute-experience-pg.html');
print_r($graph);

Will return

OpenGraph Object

(
    [_values:OpenGraph:private] => Array
        (
            [type] => article
            [video] => http://www.avessotv.com.br/player/flowplayer/flowplayer-3.2.7.swf?config=%7B%27clip%27%3A%7B%27url%27%3A%27http%3A%2F%2Fwww.avessotv.com.br%2Fmedia%2Fprogramas%2Fpantene.flv%27%7D%7D
            [image] => /wp-content/thumbnails/9025.jpg
            [site_name] => Programa Avesso - Bastidores
            [title] => Bastidores Ã¢Â€ÂœPantene Institute ExperienceÃ¢Â€Â P&G
            [url] => http://www.avessotv.com.br/bastidores-pantene-institute-experience-pg.html
            [description] => Confira os bastidores do Pantene Institute Experience, da Procter &#038; Gamble. www.pantene.com.br Mais imagens:
        )

    [_position:OpenGraph:private] => 0
)

Github user scottmac seems to have abandoned his OpenGraph project, but there's a currently (early 2016) updated version, with fixes, here: https://github.com/AramZS/opengraph — JoLoCo, Apr 21 '16 at 02:35
I like this package, but it doesn't work with duplicated tags, i mean it get the last duplicated tag, for example Youtube is duplicating tags (I don't know why): ..., and the last one (that is the one this plugin gets) downloads a file. THAT SUCKS YOUTUBE! — Miguel Peniche, Jun 22 '16 at 23:40
Does anybody know why this is not fetching og:site_name from some URLs like https://www.ajio.com/ajio-micro-print-spread-collar-shirt-/p/460292463_blue? — chithra, May 17 '19 at 05:52

score 31 · Accepted Answer · edited May 20 '15 at 14:26

31

When parsing data from HTML, you really shouldn't use regex. Take a look at the DOMXPath Query function.

Now, the actual code could be :

[EDIT] A better query for XPath was given by Stefan Gehrig, so the code can be shortened to :

libxml_use_internal_errors(true); // Yeah if you are so worried about using @ with warnings
$doc = new DomDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$query = '//*/meta[starts-with(@property, \'og:\')]';
$metas = $xpath->query($query);
$rmetas = array();
foreach ($metas as $meta) {
    $property = $meta->getAttribute('property');
    $content = $meta->getAttribute('content');
    $rmetas[$property] = $content;
}
var_dump($rmetas);

Instead of :

$doc = new DomDocument();
@$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$query = '//*/meta';
$metas = $xpath->query($query);
$rmetas = array();
foreach ($metas as $meta) {
    $property = $meta->getAttribute('property');
    $content = $meta->getAttribute('content');
    if(!empty($property) && preg_match('#^og:#', $property)) {
        $rmetas[$property] = $content;
    }
}
var_dump($rmetas);

edited May 20 '15 at 14:26

David Bengoa

127
1
7

answered Sep 17 '11 at 12:29

Tom

1,647
11
24

3

Dude, we live in not-imaginary world, where html is not proper everywhere. Check your code on http://www.imdb.com/title/tt0120737/ – zerkms Sep 17 '11 at 12:37
1

`@` is not a solution. Don't pretend there is no warnings, but write a code that don't emit them – zerkms Sep 17 '11 at 12:46
1

It was for the example purpose, but yeah now I guess it should be ok ? – Tom Sep 17 '11 at 12:49
still working 5 years on.. and so far the easiest, working and straightforward solution. Below answer using opengraph class still a little complicated if need to convert to json since it returns object. – Someone Special Feb 21 '17 at 15:41

zerkms · Answer 3 · 2011-09-17T12:40:48.433

4

How about:

preg_match_all('~<\s*meta\s+property="(og:[^"]+)"\s+content="([^"]*)~i', $str, $matches);

So, yes, grab the page with any way you can and parse with regex

edited Sep 17 '11 at 12:40

answered Sep 17 '11 at 12:21

zerkms

249,484
69
436
539

Thanks, but I hoped to find a method other than preg_match :) – Googlebot Sep 17 '11 at 12:28
@zerkms it's dirty, unreliable and way more unefficient than DomDocument when it comes to parsing HTML. – Tom Sep 17 '11 at 12:30
@Thomas Cantonnet: unefficient?? `preg_replace` is ~100 times faster on http://www.imdb.com/title/tt0120737/ than your solution and it **doesn't throw any warnings**, lol? – zerkms Sep 17 '11 at 12:36
1

@zerkms Well, try your code with this : < meta property="test" content="none" /> Does it work ? No. Does it work ? No. – Tom Sep 17 '11 at 12:38
@Thomas Cantonnet: fixed first, second case can be covered with second regex. 1. any comments on *performance*? you told it is less efficient 2. can you fix your code so it works **without warnings**? – zerkms Sep 17 '11 at 12:41
Dude, we live in a world where HTML is written in so many different ways that regex can't match everything. – Tom Sep 17 '11 at 12:41
@Thomas Cantonnet: get more cases please that 2 regex lines couldn't cover. – zerkms Sep 17 '11 at 12:43
@Thomas Cantonnet: `@` is not a solution. Fix the code, not pretend that there is no warnings – zerkms Sep 17 '11 at 12:44
And yeah, there are a million other cases I could give you. With other properties in between, \n's, etc, etc ... – Tom Sep 17 '11 at 12:48
1

Hey guys, you are both right in some sense. preg_match is fast but unreliable. DOM is reliable but slow and resource eater. I personally prefer preg_match but a tiny change in the structure can ruin all your world. – Googlebot Sep 17 '11 at 12:49
I beg to differ on the slow and resource eater. IMDB parsed in 0.00565 seconds, but hey, just trying to give you the most scalable solution. – Tom Sep 17 '11 at 12:51

score 3 · Answer 4 · answered Feb 16 '18 at 18:19

This function does the job without dependency and DOM parsing:

function getOgTags($html)
{
    $pattern='/<\s*meta\s+property="og:([^"]+)"\s+content="([^"]*)/i';
    if(preg_match_all($pattern, $html, $out))
        return array_combine($out[1], $out[2]);
    return array();
}

test code:

$x=' <title>php - Using domDocument, and parsing info, I would like to get the &#39;href&#39; contents of an &#39;a&#39; tag - Stack Overflow</title>
        <link rel="shortcut icon" href="https://cdn.sstatic.net/Sites/stackoverflow/img/favicon.ico?v=4f32ecc8f43d">
        <link rel="apple-touch-icon image_src" href="https://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-icon.png?v=c78bd457575a">
        <link rel="search" type="application/opensearchdescription+xml" title="Stack Overflow" href="/opensearch.xml">
        <meta name="referrer" content="origin" />


        <meta property="og:type" content="website"/>
        <meta property="og:url" content="https://stackoverflow.com/questions/5278418/using-domdocument-and-parsing-info-i-would-like-to-get-the-href-contents-of"/>
        <meta property="og:image" itemprop="image primaryImageOfPage" content="https://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-icon@2.png?v=73d79a89bded" />
        <meta name="twitter:card" content="summary"/>
        <meta name="twitter:domain" content="stackoverflow.com"/>
        <meta name="twitter:title" property="og:title" itemprop="title name" content="Using domDocument, and parsing info, I would like to get the &#39;href&#39; contents of an &#39;a&#39; tag" />
        <meta name="twitter:description" property="og:description" itemprop="description" content="Possible Duplicate:
  Regular expression for grabbing the href attribute of an A element  
This displays the what is between the a tag, but I would like a way to get the href contents as well.

Is..." />';
echo '<pre>';
var_dump(getOgTags($x));

and you get:

array(3) {
  ["type"]=>
  string(7) "website"
  ["url"]=>
  string(119) "https://stackoverflow.com/questions/5278418/using-domdocument-and-parsing-info-i-would-like-to-get-the-href-contents-of"
  ["image"]=>
  string(85) "https://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-icon@2.png?v=73d79a89bded"
}

It doesn't get the title or description so not really helpful. You assume the properties are always in the same place. — Panama Jack, Dec 01 '22 at 19:05

score 2 · Answer 5 · answered Mar 01 '15 at 10:45

As per this method you will get key pair array of fabcebook open graph tags.

 $url="http://fbcpictures.in";
 $site_html=  file_get_contents($url);
    $matches=null;
    preg_match_all('~<\s*meta\s+property="(og:[^"]+)"\s+content="([^"]*)~i',     $site_html,$matches);
    $ogtags=array();
    for($i=0;$i<count($matches[1]);$i++)
    {
        $ogtags[$matches[1][$i]]=$matches[2][$i];
    }

Output of facebook open graph tags

score 1 · Answer 6 · answered Oct 31 '19 at 06:57

Here is what i am using to extract Og tags.

function get_og_tags($get_url = "", $ret = 0)
{

    if ($get_url != "") {
        $title = "";
        $description = "";
        $keywords = "";
        $og_title = "";
        $og_image = "";
        $og_url = "";
        $og_description = "";
        $full_link = "";
        $image_urls = array();
        $og_video_name = "";
        $youtube_video_url="";

        $get_url = $get_url;

        $ret_data = file_get_contents_curl($get_url);
        //$html = file_get_contents($get_url);

        $html = $ret_data['curlData'];
        $full_link = $ret_data['full_link'];

        $full_link = addhttp($full_link);


        //parsing begins here:
        $doc = new DOMDocument();
        @$doc->loadHTML($html);
        $nodes = $doc->getElementsByTagName('title');
        if ($nodes->length == 0) {
            $title = $get_url;
        } else {
            $title = $nodes->item(0)->nodeValue;
        }
        //get and display what you need:
        $metas = $doc->getElementsByTagName('meta');
        for ($i = 0; $i < $metas->length; $i++) {
            $meta = $metas->item($i);
            if ($meta->getAttribute('name') == 'description')
                $description = $meta->getAttribute('content');
            if ($meta->getAttribute('name') == 'keywords')
                $keywords = $meta->getAttribute('content');
        }
        $og = $doc->getElementsByTagName('og');
        for ($i = 0; $i < $metas->length; $i++) {
            $meta = $metas->item($i);
            if ($meta->getAttribute('property') == 'og:title')
                $og_title = $meta->getAttribute('content');

            if ($meta->getAttribute('property') == 'og:url')
                $og_url = $meta->getAttribute('content');

            if ($meta->getAttribute('property') == 'og:image')
                $og_image = $meta->getAttribute('content');

            if ($meta->getAttribute('property') == 'og:description')
                $og_description = $meta->getAttribute('content');

            // for sociotube video share 
            if ($meta->getAttribute('property') == 'og:video_name')
                $og_video_name = $meta->getAttribute('content'); 

            // for sociotube youtube video share 
            if ($meta->getAttribute('property') == 'og:youtube_video_url')
                $youtube_video_url = $meta->getAttribute('content');    

        }

        //if no image found grab images from body
        if ($og_image != "") {
            $image_urls[] = $og_image;
        } else {
            $xpath = new DOMXPath($doc);
            $nodelist = $xpath->query("//img"); // find your image
            $imgCount = 0;

            for ($i = 0; $i < $nodelist->length; $i++) {
                $node = $nodelist->item($i); // gets the 1st image
                if (isset($node->attributes->getNamedItem('src')->nodeValue)) {
                    $src = $node->attributes->getNamedItem('src')->nodeValue;
                }
                if (isset($node->attributes->getNamedItem('src')->value)) {
                    $src = $node->attributes->getNamedItem('src')->value;
                }
                if (isset($src)) {
                    if (!preg_match('/blank.(.*)/i', $src) && filter_var($src, FILTER_VALIDATE_URL)) {
                        $image_urls[] = $src;
                        if ($imgCount == 10) break;
                        $imgCount++;
                    }
                }
            }
        }

        $page_title = ($og_title == "") ? $title : $og_title;
        if(!empty($og_video_name)){
            // for sociotube video share 
            $page_body = $og_video_name;
        }else{
            // for post share 
           $page_body = ($og_description == "") ? $description : $og_description; 
        }

        $output = array('title' => $page_title, 'images' => $image_urls, 'content' => $page_body, 'link' => $full_link,'video_name'=>$og_video_name,'youtube_video_url'=>$youtube_video_url);
        if ($ret == 1) {
            return $output; //output JSON data
        }
        echo json_encode($output); //output JSON data

        die;
    } else {
        $data = array('error' => "Url not found");
        if ($ret == 1) {
            return $data; //output JSON data
        }
        echo json_encode($data);
        die;
    }
}

usage of the function

$url = "https://www.alectronics.com";
$tagsArray = get_og_tags($url);
print_r($tagsArray);

score 0 · Answer 7 · answered Sep 17 '11 at 12:31

0

The more XMLish way would be to use XPath:

$xml = simplexml_load_file('http://ogp.me/');
$xml->registerXPathNamespace('h', 'http://www.w3.org/1999/xhtml');
$result = array();
foreach ($xml->xpath('//h:meta[starts-with(@property, \'og:\')]') as $meta) {
    $result[(string)$meta['property']]  = (string)$meta['content'];
}
print_r($result);

Unfortunately the namespace registration is needed if the HTML document uses a namespace declaration in the <html>-tag.

answered Sep 17 '11 at 12:31

Stefan Gehrig

82,642
24
155
189

Try your code with http://www.imdb.com/title/tt0120737/ ;-) Have gotten a really long list of warnings – zerkms Sep 17 '11 at 12:37
2

No need to start some sort of flamewar here... I actually would go for the `preg_match`-solution, but I just wanted to show a different and more elegant approach - which unfortunately does have some problems in the real world (most often due to the use of HTML entities or unescaped characters like `<`, `>`, `&` etc.) – Stefan Gehrig Sep 17 '11 at 12:48
Then, it is associated with unpredictable results; as a wide range of namespaces is used in html webpages. But I appreciate your way of thinking! – Googlebot Sep 17 '11 at 12:52
You're right. The approach is more suitable for a controlled environment in which you know your documents. The namespace issue could be solved by inspecting the declared namespaces, so in my opinion the bigger problem is that most HTML documents in the wild are far away from being standards compliant. – Stefan Gehrig Sep 17 '11 at 14:35

score -1 · Answer 8 · answered Apr 26 '20 at 18:12

-1

With native PHP function get_meta_tags().

https://php.net/get_meta_tags

answered Apr 26 '20 at 18:12

J. Doe

1
1

1

As stated in the question, "this only works for meta tags with name attributes", which Open Graph metatags don't have and so is completely useless for the required purpose. – WebSmithery Aug 12 '20 at 17:57

How to get Open Graph Protocol of a webpage by php?

8 Answers8

Linked

Related