0

I am trying to extract all meta tags in web page, currently am using preg_match_all to get that, but unfortunately it returns an empty strings for the array indexes.

 <?php
  $meta_tag_pattern = '/<meta(?:"[^"]*"[\'"]*|\'[^\']*\'[\'"]*|[^\'">])+>/';
  $meta_url = file_get_contents('test.html');
  if(preg_match_all($meta_tag_pattern, $meta_url, $matches) == 1)
    echo "there is a match <br>";

  print_r($matches);
?>

Returned array:

Array ( [0] => Array ( [0] => [1] => [2] => [3] => ) ) Array ( [0] => Array ( [0] => [1] => [2] => [3] => ) ) 
H Aßdøµ
  • 2,925
  • 4
  • 26
  • 37
  • Since `preg_match_all` returns the number of matches, I suggest you to write only: `if(preg_match_all($meta_tag_pattern, $meta_url, $matches) )` or to use `preg_match` if you are looking for the first result. – Casimir et Hippolyte May 22 '14 at 16:20
  • It is more easy to use DOMDocument to obtain the result you want. – Casimir et Hippolyte May 22 '14 at 16:21
  • Here's how I'd capture meta tags /]+>/ . What do you want to capture? The whole tag? The attributes? The attribute values? – bloodyKnuckles May 22 '14 at 16:26
  • @CasimiretHippolyte I thought it may be a logic error in my code, so I tried other methods to `write` if statement. Am looking for a performance way to parse the page, that it way I didn't use `DOMDocument`. – H Aßdøµ May 22 '14 at 16:26
  • @bloodyKnuckles It returns empty strings too, I want to capture the whole tag. – H Aßdøµ May 22 '14 at 16:33
  • I understand, building the DOM Tree has a cost, but once it is done, the queries are fast. And don't forget that a regex has a cost too. – Casimir et Hippolyte May 22 '14 at 16:34
  • 1
    add `i` flag i.e. case insensitive. Check the source html code to ensure your code from `print_r` not been parsed as HTML code by browser. – Deadooshka May 22 '14 at 16:56
  • @Deadooshka You were right, it turns out that `print_r` outputs rendered as `html` by the browser, to avoid that I Googled and found this snippets: `function print_html_r($var) { ob_start(); print_r($var); $contents = ob_get_contents(); ob_end_clean(); print htmlentities($contents); }` Would you re-post your comment as answer so I can accept it? – H Aßdøµ May 23 '14 at 16:33

2 Answers2

3

An example with DOMDocument:

$url = 'test.html';

$dom = new DOMDocument();
@$dom->loadHTMLFile($url);

$metas = $dom->getElementsByTagName('meta');

foreach ($metas as $meta) {
    echo htmlspecialchars($dom->saveHTML($meta));
}
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
1

UPDATED: Example grabbing meta tags from URL:

$meta_tag_pattern = '/<meta\s[^>]+>/';
$meta_url = file_get_contents('http://stackoverflow.com/questions/10551116/html-php-escape-and-symbols-while-echoing');
if(preg_match_all($meta_tag_pattern, $meta_url, $matches))
  echo "there is a match <br>";

foreach ( $matches[0] as $value ) {
    print htmlentities($value) . '<br>';
}

Outputs:

there is a match 
<meta name="twitter:card" content="summary">
<meta name="twitter:domain" content="stackoverflow.com"/>
<meta name="og:type" content="website" />
...

Looks like part of the problem is the browser rendering the meta tags as meta tags and not displaying the text when you print_r the output, so they need to be escaped.

bloodyKnuckles
  • 11,551
  • 3
  • 29
  • 37
  • Whould you try to extract the `meta` tag of [stackoverflow](http://stackoverflow.com/questions/23812216/preg-match-returns-an-empty-string-even-there-is-a-match/) page as an example. – H Aßdøµ May 22 '14 at 16:44
  • The `print_r` outputs was rendered as `html` by the browser I looked at by viewing the page source, that is why I didn't see the outputs and thought it as empty strings. – H Aßdøµ May 23 '14 at 16:36