preg_match returns an empty string even there is a match

Question

I am trying to extract all meta tags in web page, currently am using preg_match_all to get that, but unfortunately it returns an empty strings for the array indexes.

 <?php
  $meta_tag_pattern = '/<meta(?:"[^"]*"[\'"]*|\'[^\']*\'[\'"]*|[^\'">])+>/';
  $meta_url = file_get_contents('test.html');
  if(preg_match_all($meta_tag_pattern, $meta_url, $matches) == 1)
    echo "there is a match <br>";

  print_r($matches);
?>

Returned array:

Array ( [0] => Array ( [0] => [1] => [2] => [3] => ) ) Array ( [0] => Array ( [0] => [1] => [2] => [3] => ) )

Since `preg_match_all` returns the number of matches, I suggest you to write only: `if(preg_match_all($meta_tag_pattern, $meta_url, $matches) )` or to use `preg_match` if you are looking for the first result. — Casimir et Hippolyte, May 22 '14 at 16:20
It is more easy to use DOMDocument to obtain the result you want. — Casimir et Hippolyte, May 22 '14 at 16:21
Here's how I'd capture meta tags /]+>/ . What do you want to capture? The whole tag? The attributes? The attribute values? — bloodyKnuckles, May 22 '14 at 16:26
@CasimiretHippolyte I thought it may be a logic error in my code, so I tried other methods to `write` if statement. Am looking for a performance way to parse the page, that it way I didn't use `DOMDocument`. — H Aßdøµ, May 22 '14 at 16:26
@bloodyKnuckles It returns empty strings too, I want to capture the whole tag. — H Aßdøµ, May 22 '14 at 16:33
I understand, building the DOM Tree has a cost, but once it is done, the queries are fast. And don't forget that a regex has a cost too. — Casimir et Hippolyte, May 22 '14 at 16:34
add `i` flag i.e. case insensitive. Check the source html code to ensure your code from `print_r` not been parsed as HTML code by browser. — Deadooshka, May 22 '14 at 16:56
@Deadooshka You were right, it turns out that `print_r` outputs rendered as `html` by the browser, to avoid that I Googled and found this snippets: `function print_html_r($var) { ob_start(); print_r($var); $contents = ob_get_contents(); ob_end_clean(); print htmlentities($contents); }` Would you re-post your comment as answer so I can accept it? — H Aßdøµ, May 23 '14 at 16:33

Casimir et Hippolyte · Answer 1 · 2014-05-22T17:11:39.203

3

An example with DOMDocument:

$url = 'test.html';

$dom = new DOMDocument();
@$dom->loadHTMLFile($url);

$metas = $dom->getElementsByTagName('meta');

foreach ($metas as $meta) {
    echo htmlspecialchars($dom->saveHTML($meta));
}

edited May 22 '14 at 17:11

answered May 22 '14 at 16:36

Casimir et Hippolyte

88,009
5
94
125

Great answer. DOMDocument is always better than other “ripping apart” methods. – Giacomo1968 May 22 '14 at 17:19
1

@JakeGould: Yes, however using `preg_match_all` for the same task with a good pattern and a not too badly formatted html is 100x faster. – Casimir et Hippolyte May 22 '14 at 17:32

bloodyKnuckles · Accepted Answer · 2014-05-22T17:32:37.377

1

UPDATED: Example grabbing meta tags from URL:

$meta_tag_pattern = '/<meta\s[^>]+>/';
$meta_url = file_get_contents('http://stackoverflow.com/questions/10551116/html-php-escape-and-symbols-while-echoing');
if(preg_match_all($meta_tag_pattern, $meta_url, $matches))
  echo "there is a match <br>";

foreach ( $matches[0] as $value ) {
    print htmlentities($value) . '<br>';
}

Outputs:

there is a match 
<meta name="twitter:card" content="summary">
<meta name="twitter:domain" content="stackoverflow.com"/>
<meta name="og:type" content="website" />
...

Looks like part of the problem is the browser rendering the meta tags as meta tags and not displaying the text when you print_r the output, so they need to be escaped.

edited May 22 '14 at 17:32

answered May 22 '14 at 16:37

bloodyKnuckles

11,551
3
29
37

Whould you try to extract the `meta` tag of [stackoverflow](http://stackoverflow.com/questions/23812216/preg-match-returns-an-empty-string-even-there-is-a-match/) page as an example. – H Aßdøµ May 22 '14 at 16:44
The `print_r` outputs was rendered as `html` by the browser I looked at by viewing the page source, that is why I didn't see the outputs and thought it as empty strings. – H Aßdøµ May 23 '14 at 16:36

preg_match returns an empty string even there is a match

2 Answers2