scraping a web page for all headings and contents in php

Question

I have looked around the web on how to scrape all headings (h1 to h6) with content. Like this <h2>Some Heading</h2>, <h4>Some Heading</h4>. I have even looked at file_get_html() which PHP does not recognize. The code I have written so far lets you see the content but with out the h1 tags. I am new to this so if anyone can help me I would appreciate it. Here is my code I have now:

<html>
<head>
<title></title>
</head>
<body>
<?php
$theurl = "http://www.msn.com";
if(!($contents=file_get_contents($theurl)))
  {
    echo 'Could not open URL';
    exit;
}else{
echo "The $theurl is open <br />";
}
$pattern = "/<h[1-6]>(.*?)<\/h[1-6]>/si";
$found = preg_match_all($pattern,$contents,$matches);
if(is_array($matches) && count($matches) >= 1){
 echo "Scraping $theurl<br />";
for($i = 1; $i <= $found - 1; $i++){
echo $matches[0][$i];
}
 }else{
echo "No heading found";
 }
?>
</body>
</html>

If you're just trying to get the text between 2 tags and that's it, a regex works just fine. if you're trying to dissect an html document, you may want to go with a solution built with [`DomDocument`](http://php.net/manual/en/class.domdocument.php) — castis, Jan 28 '15 at 20:07
possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — , Jan 28 '15 at 20:07
trying to parse html with regular expressions - yup a duplicate — , Jan 28 '15 at 20:30
All of the examples I see either display the h1 tag or the content between the tags. I am looking to display the h1 open and closing tag along with the content. — Bigroad, Jan 28 '15 at 22:04

scraping a web page for all headings and contents in php

0 Answers0