2 preg matches VS 1?

Question

I am trying to get a title of a page in two ways:

with the html meta < title> lable and with Open Grap og:title.

So I'm using the following regex expressions:

$title_expression = "/<title>([^<]*)<\/title>/"; 
$title_og_expression = "/og:title[^>]+content=\"([^\"]*)\"[^>]*>/"; 

preg_match($this->title_expression, $this->content, $match_title);
preg_match($this->title_og_expression, $this->content, $match_title2);

$output = $match_title[1].'+'.$matcht_title2[1];

Is there a way I can do this with only one preg_match?

Note that I don't want One OR the Other, but rather BOTH values.

Thanks for your advice!

[preg_match_all](http://us3.php.net/preg_match_all) should do the trick. It applies the regex as often as possible to the string to test so you should be able to simply join the two regexes with an OR. — Kiruse, Dec 02 '13 at 18:45

Casimir et Hippolyte · Answer 1 · 2013-12-02T19:27:10.977

Using DOMDocument is more appropriate for this task:

$doc = new DOMDocument();
@$doc->loadHTML($this->content);

$title = $doc->getElementsByTagName('title')->item(0)->textContent;

$metas = $doc->getElementsByTagName('meta');

$ogtitle = '';

foreach ($metas as $meta) {
    if ($meta->getAttribute('property') == 'og:title') {
        $ogtitle = $meta->getAttribute('content');
        break;
    }
}
$output = $title . '+' . $ogtitle;

score 0 · Accepted Answer · answered Dec 02 '13 at 19:24

Here's an expression that will match both, but you'll have to check two capture groups to see which one matched. Will this work?

/<title>(.*?)<\/title>|<og:title.*?content="(.*?)"/i

Fiddle: http://www.rexfiddle.net/N3Hth2o

Edit: This expression will match both with one capture group, but is dangerous because it can potentially match something inside the title if the title has HTML-like characters in it. Again, a better way to do this is with a DOM parser.

/<(?:title>|og:title.*?content=")(.*?)(?:</title>|".*?>)/i

Fiddle: http://www.rexfiddle.net/NBXP5rq

score 0 · Answer 3 · answered Dec 02 '13 at 19:44

Do not use RegEx to parse HTML. DOM+Xpath are the tools you need.

DomXpath::evaluate() allows to do this with a single Xpath expression:

$html = <<<'HTML'
<html prefix="og: http://ogp.me/ns#">
  <head>
    <title>Title</title>
    <meta property="og:title" content="OG Title" />
  </head>
</html>
HTML;

$dom = new DOMDocument();
$dom->loadHtml($html);
$xpath = new DOMXpath($dom);

$title = $xpath->evaluate('concat(/html/head/title, "+", /html/head/meta[@property = "og:title"]/@content)');
var_dump($title);

Output:

string(14) "Title+OG Title"

concat() is an xpath function that concatenates all arguments. If an argument is a nodeset the text content of the first node will be used.

/html/head/title selects the title element.

/html/head/meta[@property = "og:title"]/@content fetches the content attribute of the meta element with the property attribute "og:title".

Why is it better to use loadHTML VS DOM? I've had trouble scaping website with DOM and ended using https://github.com/morshedalam/url-scraper-php — James, Dec 04 '13 at 11:28
You mean DOM vs PCRE. http://stackoverflow.com/questions/3577641/how-do-you-parse-and-process-html-xml-in-php/3577662#3577662 — ThW, Dec 04 '13 at 11:32

2 preg matches VS 1?

3 Answers3