28

PHP has a simple command to get meta tags of a webpage (get_meta_tags), but this only works for meta tags with name attributes. However, Open Graph Protocol is becoming more and more popular these days. What is the easiest way to get the values of opg from a webpage. For example:

<meta property="og:url" content=""> 
<meta property="og:title" content=""> 
<meta property="og:description" content=""> 
<meta property="og:type" content=""> 

The basic way I see is to get the page via cURL and parse it with regex. Any idea?

Googlebot
  • 15,159
  • 44
  • 133
  • 229

8 Answers8

50

Really simple and well done:

Using https://github.com/scottmac/opengraph

$graph = OpenGraph::fetch('http://www.avessotv.com.br/bastidores-pantene-institute-experience-pg.html');
print_r($graph);

Will return

OpenGraph Object

(
    [_values:OpenGraph:private] => Array
        (
            [type] => article
            [video] => http://www.avessotv.com.br/player/flowplayer/flowplayer-3.2.7.swf?config=%7B%27clip%27%3A%7B%27url%27%3A%27http%3A%2F%2Fwww.avessotv.com.br%2Fmedia%2Fprogramas%2Fpantene.flv%27%7D%7D
            [image] => /wp-content/thumbnails/9025.jpg
            [site_name] => Programa Avesso - Bastidores
            [title] => Bastidores “Pantene Institute Experience†P&G
            [url] => http://www.avessotv.com.br/bastidores-pantene-institute-experience-pg.html
            [description] => Confira os bastidores do Pantene Institute Experience, da Procter &#038; Gamble. www.pantene.com.br Mais imagens:
        )

    [_position:OpenGraph:private] => 0
)
Guilherme Viebig
  • 6,901
  • 3
  • 28
  • 30
  • 2
    Github user scottmac seems to have abandoned his OpenGraph project, but there's a currently (early 2016) updated version, with fixes, here: https://github.com/AramZS/opengraph – JoLoCo Apr 21 '16 at 02:35
  • I like this package, but it doesn't work with duplicated tags, i mean it get the last duplicated tag, for example Youtube is duplicating tags (I don't know why): ..., and the last one (that is the one this plugin gets) downloads a file. THAT SUCKS YOUTUBE! – Miguel Peniche Jun 22 '16 at 23:40
  • Does anybody know why this is not fetching og:site_name from some URLs like https://www.ajio.com/ajio-micro-print-spread-collar-shirt-/p/460292463_blue? – chithra May 17 '19 at 05:52
31

When parsing data from HTML, you really shouldn't use regex. Take a look at the DOMXPath Query function.

Now, the actual code could be :

[EDIT] A better query for XPath was given by Stefan Gehrig, so the code can be shortened to :

libxml_use_internal_errors(true); // Yeah if you are so worried about using @ with warnings
$doc = new DomDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$query = '//*/meta[starts-with(@property, \'og:\')]';
$metas = $xpath->query($query);
$rmetas = array();
foreach ($metas as $meta) {
    $property = $meta->getAttribute('property');
    $content = $meta->getAttribute('content');
    $rmetas[$property] = $content;
}
var_dump($rmetas);

Instead of :

$doc = new DomDocument();
@$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$query = '//*/meta';
$metas = $xpath->query($query);
$rmetas = array();
foreach ($metas as $meta) {
    $property = $meta->getAttribute('property');
    $content = $meta->getAttribute('content');
    if(!empty($property) && preg_match('#^og:#', $property)) {
        $rmetas[$property] = $content;
    }
}
var_dump($rmetas);
David Bengoa
  • 127
  • 1
  • 7
Tom
  • 1,647
  • 11
  • 24
  • 3
    Dude, we live in not-imaginary world, where html is not proper everywhere. Check your code on http://www.imdb.com/title/tt0120737/ – zerkms Sep 17 '11 at 12:37
  • 1
    `@` is not a solution. Don't pretend there is no warnings, but write a code that don't emit them – zerkms Sep 17 '11 at 12:46
  • 1
    It was for the example purpose, but yeah now I guess it should be ok ? – Tom Sep 17 '11 at 12:49
  • still working 5 years on.. and so far the easiest, working and straightforward solution. Below answer using opengraph class still a little complicated if need to convert to json since it returns object. – Someone Special Feb 21 '17 at 15:41
4

How about:

preg_match_all('~<\s*meta\s+property="(og:[^"]+)"\s+content="([^"]*)~i', $str, $matches);

So, yes, grab the page with any way you can and parse with regex

zerkms
  • 249,484
  • 69
  • 436
  • 539
  • Thanks, but I hoped to find a method other than preg_match :) – Googlebot Sep 17 '11 at 12:28
  • @zerkms it's dirty, unreliable and way more unefficient than DomDocument when it comes to parsing HTML. – Tom Sep 17 '11 at 12:30
  • @Thomas Cantonnet: unefficient?? `preg_replace` is ~100 times faster on http://www.imdb.com/title/tt0120737/ than your solution and it **doesn't throw any warnings**, lol? – zerkms Sep 17 '11 at 12:36
  • 1
    @zerkms Well, try your code with this : < meta property="test" content="none" /> Does it work ? No. Does it work ? No. – Tom Sep 17 '11 at 12:38
  • @Thomas Cantonnet: fixed first, second case can be covered with second regex. 1. any comments on *performance*? you told it is less efficient 2. can you fix your code so it works **without warnings**? – zerkms Sep 17 '11 at 12:41
  • Dude, we live in a world where HTML is written in so many different ways that regex can't match everything. – Tom Sep 17 '11 at 12:41
  • @Thomas Cantonnet: get more cases please that 2 regex lines couldn't cover. – zerkms Sep 17 '11 at 12:43
  • @Thomas Cantonnet: `@` is not a solution. Fix the code, not pretend that there is no warnings – zerkms Sep 17 '11 at 12:44
  • And yeah, there are a million other cases I could give you. With other properties in between, \n's, etc, etc ... – Tom Sep 17 '11 at 12:48
  • 1
    Hey guys, you are both right in some sense. preg_match is fast but unreliable. DOM is reliable but slow and resource eater. I personally prefer preg_match but a tiny change in the structure can ruin all your world. – Googlebot Sep 17 '11 at 12:49
  • I beg to differ on the slow and resource eater. IMDB parsed in 0.00565 seconds, but hey, just trying to give you the most scalable solution. – Tom Sep 17 '11 at 12:51
3

This function does the job without dependency and DOM parsing:

function getOgTags($html)
{
    $pattern='/<\s*meta\s+property="og:([^"]+)"\s+content="([^"]*)/i';
    if(preg_match_all($pattern, $html, $out))
        return array_combine($out[1], $out[2]);
    return array();
}

test code:

$x=' <title>php - Using domDocument, and parsing info, I would like to get the &#39;href&#39; contents of an &#39;a&#39; tag - Stack Overflow</title>
        <link rel="shortcut icon" href="https://cdn.sstatic.net/Sites/stackoverflow/img/favicon.ico?v=4f32ecc8f43d">
        <link rel="apple-touch-icon image_src" href="https://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-icon.png?v=c78bd457575a">
        <link rel="search" type="application/opensearchdescription+xml" title="Stack Overflow" href="/opensearch.xml">
        <meta name="referrer" content="origin" />


        <meta property="og:type" content="website"/>
        <meta property="og:url" content="https://stackoverflow.com/questions/5278418/using-domdocument-and-parsing-info-i-would-like-to-get-the-href-contents-of"/>
        <meta property="og:image" itemprop="image primaryImageOfPage" content="https://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-icon@2.png?v=73d79a89bded" />
        <meta name="twitter:card" content="summary"/>
        <meta name="twitter:domain" content="stackoverflow.com"/>
        <meta name="twitter:title" property="og:title" itemprop="title name" content="Using domDocument, and parsing info, I would like to get the &#39;href&#39; contents of an &#39;a&#39; tag" />
        <meta name="twitter:description" property="og:description" itemprop="description" content="Possible Duplicate:
  Regular expression for grabbing the href attribute of an A element  
This displays the what is between the a tag, but I would like a way to get the href contents as well.

Is..." />';
echo '<pre>';
var_dump(getOgTags($x));

and you get:

array(3) {
  ["type"]=>
  string(7) "website"
  ["url"]=>
  string(119) "https://stackoverflow.com/questions/5278418/using-domdocument-and-parsing-info-i-would-like-to-get-the-href-contents-of"
  ["image"]=>
  string(85) "https://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-icon@2.png?v=73d79a89bded"
}
MSS
  • 3,520
  • 24
  • 29
  • It doesn't get the title or description so not really helpful. You assume the properties are always in the same place. – Panama Jack Dec 01 '22 at 19:05
2

As per this method you will get key pair array of fabcebook open graph tags.

 $url="http://fbcpictures.in";
 $site_html=  file_get_contents($url);
    $matches=null;
    preg_match_all('~<\s*meta\s+property="(og:[^"]+)"\s+content="([^"]*)~i',     $site_html,$matches);
    $ogtags=array();
    for($i=0;$i<count($matches[1]);$i++)
    {
        $ogtags[$matches[1][$i]]=$matches[2][$i];
    }

Output of facebook open graph tags

Bhaskar Bhatt
  • 1,399
  • 13
  • 19
1

Here is what i am using to extract Og tags.

function get_og_tags($get_url = "", $ret = 0)
{

    if ($get_url != "") {
        $title = "";
        $description = "";
        $keywords = "";
        $og_title = "";
        $og_image = "";
        $og_url = "";
        $og_description = "";
        $full_link = "";
        $image_urls = array();
        $og_video_name = "";
        $youtube_video_url="";

        $get_url = $get_url;

        $ret_data = file_get_contents_curl($get_url);
        //$html = file_get_contents($get_url);

        $html = $ret_data['curlData'];
        $full_link = $ret_data['full_link'];

        $full_link = addhttp($full_link);


        //parsing begins here:
        $doc = new DOMDocument();
        @$doc->loadHTML($html);
        $nodes = $doc->getElementsByTagName('title');
        if ($nodes->length == 0) {
            $title = $get_url;
        } else {
            $title = $nodes->item(0)->nodeValue;
        }
        //get and display what you need:
        $metas = $doc->getElementsByTagName('meta');
        for ($i = 0; $i < $metas->length; $i++) {
            $meta = $metas->item($i);
            if ($meta->getAttribute('name') == 'description')
                $description = $meta->getAttribute('content');
            if ($meta->getAttribute('name') == 'keywords')
                $keywords = $meta->getAttribute('content');
        }
        $og = $doc->getElementsByTagName('og');
        for ($i = 0; $i < $metas->length; $i++) {
            $meta = $metas->item($i);
            if ($meta->getAttribute('property') == 'og:title')
                $og_title = $meta->getAttribute('content');

            if ($meta->getAttribute('property') == 'og:url')
                $og_url = $meta->getAttribute('content');

            if ($meta->getAttribute('property') == 'og:image')
                $og_image = $meta->getAttribute('content');

            if ($meta->getAttribute('property') == 'og:description')
                $og_description = $meta->getAttribute('content');

            // for sociotube video share 
            if ($meta->getAttribute('property') == 'og:video_name')
                $og_video_name = $meta->getAttribute('content'); 

            // for sociotube youtube video share 
            if ($meta->getAttribute('property') == 'og:youtube_video_url')
                $youtube_video_url = $meta->getAttribute('content');    

        }

        //if no image found grab images from body
        if ($og_image != "") {
            $image_urls[] = $og_image;
        } else {
            $xpath = new DOMXPath($doc);
            $nodelist = $xpath->query("//img"); // find your image
            $imgCount = 0;

            for ($i = 0; $i < $nodelist->length; $i++) {
                $node = $nodelist->item($i); // gets the 1st image
                if (isset($node->attributes->getNamedItem('src')->nodeValue)) {
                    $src = $node->attributes->getNamedItem('src')->nodeValue;
                }
                if (isset($node->attributes->getNamedItem('src')->value)) {
                    $src = $node->attributes->getNamedItem('src')->value;
                }
                if (isset($src)) {
                    if (!preg_match('/blank.(.*)/i', $src) && filter_var($src, FILTER_VALIDATE_URL)) {
                        $image_urls[] = $src;
                        if ($imgCount == 10) break;
                        $imgCount++;
                    }
                }
            }
        }

        $page_title = ($og_title == "") ? $title : $og_title;
        if(!empty($og_video_name)){
            // for sociotube video share 
            $page_body = $og_video_name;
        }else{
            // for post share 
           $page_body = ($og_description == "") ? $description : $og_description; 
        }

        $output = array('title' => $page_title, 'images' => $image_urls, 'content' => $page_body, 'link' => $full_link,'video_name'=>$og_video_name,'youtube_video_url'=>$youtube_video_url);
        if ($ret == 1) {
            return $output; //output JSON data
        }
        echo json_encode($output); //output JSON data

        die;
    } else {
        $data = array('error' => "Url not found");
        if ($ret == 1) {
            return $data; //output JSON data
        }
        echo json_encode($data);
        die;
    }
}

usage of the function

$url = "https://www.alectronics.com";
$tagsArray = get_og_tags($url);
print_r($tagsArray);
Muhammad Tahir
  • 2,351
  • 29
  • 25
0

The more XMLish way would be to use XPath:

$xml = simplexml_load_file('http://ogp.me/');
$xml->registerXPathNamespace('h', 'http://www.w3.org/1999/xhtml');
$result = array();
foreach ($xml->xpath('//h:meta[starts-with(@property, \'og:\')]') as $meta) {
    $result[(string)$meta['property']]  = (string)$meta['content'];
}
print_r($result);

Unfortunately the namespace registration is needed if the HTML document uses a namespace declaration in the <html>-tag.

Stefan Gehrig
  • 82,642
  • 24
  • 155
  • 189
  • Try your code with http://www.imdb.com/title/tt0120737/ ;-) Have gotten a really long list of warnings – zerkms Sep 17 '11 at 12:37
  • 2
    No need to start some sort of flamewar here... I actually would go for the `preg_match`-solution, but I just wanted to show a different and more elegant approach - which unfortunately does have some problems in the real world (most often due to the use of HTML entities or unescaped characters like `<`, `>`, `&` etc.) – Stefan Gehrig Sep 17 '11 at 12:48
  • Then, it is associated with unpredictable results; as a wide range of namespaces is used in html webpages. But I appreciate your way of thinking! – Googlebot Sep 17 '11 at 12:52
  • You're right. The approach is more suitable for a controlled environment in which you know your documents. The namespace issue could be solved by inspecting the declared namespaces, so in my opinion the bigger problem is that most HTML documents in the wild are far away from being standards compliant. – Stefan Gehrig Sep 17 '11 at 14:35
-1

With native PHP function get_meta_tags().

https://php.net/get_meta_tags

J. Doe
  • 1
  • 1
  • 1
    As stated in the question, "this only works for meta tags with name attributes", which Open Graph metatags don't have and so is completely useless for the required purpose. – WebSmithery Aug 12 '20 at 17:57