The solution is to not use regular expressions on HTML. See this great article on the subject: http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html
Bottom line is that HTML is not a regular language, so regular expressions are not a good fit. You have variations in white space, potentially unclosed tags (who is to say the HTML you are scraping is going to always be correct?), among other challenges.
Instead, use PHP's DomDocument
, impress your friends, AND do it the right way every time:
// create a new DOMDocument
$doc = new DOMDocument();
// load the string into the DOM
$doc->loadHTML('<td class="things"><div class="stuff"><p>I need to capture this text.</p></div></td>');
// since we are working with HTML fragments here, remove <!DOCTYPE
$doc->removeChild($doc->firstChild);
// likewise remove <html><body></body></html>
$doc->replaceChild($doc->firstChild->firstChild->firstChild, $doc->firstChild);
$contents = array();
//Loop through each <p> tag in the dom and grab the contents
// if you need to use selectors or get more complex here, consult the documentation
foreach($doc->getElementsByTagName('p') as $paragraph) {
$contents[] = $paragraph->textContent;
}
print_r($contents);
Documentation
This PHP extension is regarded as "standard", and is usually already installed on most web servers -- no third-party scripts or libraries required. Enjoy!