1

I am trying to get a title of a page in two ways:

with the html meta < title> lable and with Open Grap og:title.

So I'm using the following regex expressions:

$title_expression = "/<title>([^<]*)<\/title>/"; 
$title_og_expression = "/og:title[^>]+content=\"([^\"]*)\"[^>]*>/"; 

preg_match($this->title_expression, $this->content, $match_title);
preg_match($this->title_og_expression, $this->content, $match_title2);

$output = $match_title[1].'+'.$matcht_title2[1];

Is there a way I can do this with only one preg_match?

Note that I don't want One OR the Other, but rather BOTH values.

Thanks for your advice!

James
  • 458
  • 1
  • 6
  • 18
  • 2
    you should be using DOM operations, not regexes. – Marc B Dec 02 '13 at 18:40
  • [preg_match_all](http://us3.php.net/preg_match_all) should do the trick. It applies the regex as often as possible to the string to test so you should be able to simply join the two regexes with an OR. – Kiruse Dec 02 '13 at 18:45

3 Answers3

0

Using DOMDocument is more appropriate for this task:

$doc = new DOMDocument();
@$doc->loadHTML($this->content);

$title = $doc->getElementsByTagName('title')->item(0)->textContent;

$metas = $doc->getElementsByTagName('meta');

$ogtitle = '';

foreach ($metas as $meta) {
    if ($meta->getAttribute('property') == 'og:title') {
        $ogtitle = $meta->getAttribute('content');
        break;
    }
}
$output = $title . '+' . $ogtitle;
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
0

Here's an expression that will match both, but you'll have to check two capture groups to see which one matched. Will this work?

/<title>(.*?)<\/title>|<og:title.*?content="(.*?)"/i

Fiddle: http://www.rexfiddle.net/N3Hth2o


Edit: This expression will match both with one capture group, but is dangerous because it can potentially match something inside the title if the title has HTML-like characters in it. Again, a better way to do this is with a DOM parser.

/<(?:title>|og:title.*?content=")(.*?)(?:</title>|".*?>)/i

Fiddle: http://www.rexfiddle.net/NBXP5rq

qJake
  • 16,821
  • 17
  • 83
  • 135
0

Do not use RegEx to parse HTML. DOM+Xpath are the tools you need.

DomXpath::evaluate() allows to do this with a single Xpath expression:

$html = <<<'HTML'
<html prefix="og: http://ogp.me/ns#">
  <head>
    <title>Title</title>
    <meta property="og:title" content="OG Title" />
  </head>
</html>
HTML;

$dom = new DOMDocument();
$dom->loadHtml($html);
$xpath = new DOMXpath($dom);

$title = $xpath->evaluate('concat(/html/head/title, "+", /html/head/meta[@property = "og:title"]/@content)');
var_dump($title);

Output:

string(14) "Title+OG Title" 

concat() is an xpath function that concatenates all arguments. If an argument is a nodeset the text content of the first node will be used.

/html/head/title selects the title element.

/html/head/meta[@property = "og:title"]/@content fetches the content attribute of the meta element with the property attribute "og:title".

ThW
  • 19,120
  • 3
  • 22
  • 44
  • Why is it better to use loadHTML VS DOM? I've had trouble scaping website with DOM and ended using https://github.com/morshedalam/url-scraper-php – James Dec 04 '13 at 11:28
  • You mean DOM vs PCRE. http://stackoverflow.com/questions/3577641/how-do-you-parse-and-process-html-xml-in-php/3577662#3577662 – ThW Dec 04 '13 at 11:32