-2

Iam making a webcrawler and I need to extract the metadata that contains the description, this is what I did:

$html = file_get_contents('http://www.google.com');
preg_match('/<meta name="description" content="(.*)"/>\i', $html, $description);
$description_out = $description;
var_dump($description_out);

and I get this error

Warning: preg_match(): Unknown modifier '>' in C:\xampp\htdocs\webcrawler\php-web-crawler\index.php on line 21

What is the correct regular expression?

  • 1
    Possible duplicate of [RegEx match open tags except XHTML self-contained tags](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Yunnosch Jun 30 '18 at 05:56

2 Answers2

0

Your pattern is incorrect. You start with a / delimiter and then you have an unescaped / in the pattern this ends the pattern and everything after it is read as modifiers.
Then your end delimiter was on the wrong way, was \ should be /.

'/<meta name="description" content="(.*)"\/>/i',
Andreas
  • 23,610
  • 6
  • 30
  • 62
  • That works, but there is a problem. When I do a crawler to twitter for example, that page does not contain a meta description, how can I validate that there is one or not? On the other hand, the pages to which I can crawler return the content but with this at the end "/> – Diesan Romero Jun 30 '18 at 06:07
  • I assume if there is no meta the return array is empty? I.e. count ==0? – Andreas Jun 30 '18 at 06:09
  • When I try with google I get this error: Notice: Undefined offset: 1 in C:\xampp\htdocs\webcrawler\php-web-crawler\index.php on line 24 – Diesan Romero Jun 30 '18 at 06:12
  • this is my line 24: $description_out = $description[1]; – Diesan Romero Jun 30 '18 at 06:12
  • Yes and that is because the array is empty. Nothing wrong. It's your code that is not checking if it is empty or not that is at fault. Never assign variables from arrays if you don't know if the data is there or not. Check first if it's empty, count is zero or isset. – Andreas Jun 30 '18 at 06:14
  • so I need to put: `if($description == NULL) { return false } else { $description_out = $description[1]; }` to validate? – Diesan Romero Jun 30 '18 at 06:17
  • An array from preg_match is never null. Use empty, count or isset. `if(isset($description[1])){ $description_out = $description[1]; }` – Andreas Jun 30 '18 at 06:43
0

As an alternative, instead of using a regex you might use DOMDocument and DOMXPath with an xpath expression /html/head/meta[@name="description"]/@content to get the content attribute.

$document = new DOMDocument();
$document->loadHTML($html);
$xpath = new DOMXPath($document);
$items = $xpath->query('/html/head/meta[@name="description"]/@content');
foreach ($items as $item) {
    echo $item->value . "<br>";
}

The $items are of type DOMNodeList which you could loop using for example a foreach. The $item is of type DOMAttr from which you can get the value.

The fourth bird
  • 154,723
  • 16
  • 55
  • 70