How to regex scrape HTML and ignore whitespace and newlines in code?

Question

I'm putting together a quick script to scrape a page for some results and I'm having trouble figuring out how to ignore white space and new lines in my regex.

For example, here's how the page may present a result in HTML:

<td class="things">
    <div class="stuff">
        <p>I need to capture this text.</p>
    </div>
</td>

How would I change the following regex to ignore the spaces and new lines:

$regex = '/<td class="things"><div class="stuff"><p>(.*)<\/p><\/div><\/td>/i';

Any help would be appreciated. Help that also explains why you did something would be greatly appreciated!

[Tony the Pony](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) will be coming to get you... use a [DOM parser](http://www.php.net/manual/en/book.dom.php) instead ;-) — DaveRandom, Apr 03 '12 at 19:18
Simply don't use regexes for parsing HTML. use a SAX parser or DOM parser instead. — fardjad, Apr 03 '12 at 19:18

anubhava · Accepted Answer · 2012-04-03T19:25:43.510

4

Needless to caution you that you're playing with fire by trying to use regex with HTML code. Anyway to answer your question you can use this regex:

$regex='#^<td class="things">\s*<div class="stuff">\s*<p>(.*)</p>\s*</div>\s*</td>#si';

Update: Here is the DOM Parser based code to get what you want:

$html = <<< EOF
<td class="things">
    <div class="stuff">
        <p>I need to capture this text.</p>
    </div>
</td>
EOF;
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html); // loads your html
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query("//td[@class='things']/div[@class='stuff']/p");
for($i=0; $i < $nodelist->length; $i++) {
    $node = $nodelist->item($i);
    $val = $node->nodeValue;
    echo "$val\n"; // prints: I need to capture this text.
}

And now please refrain from parsing HTML using regex in your code.

edited Apr 03 '12 at 19:25

answered Apr 03 '12 at 19:19

anubhava

761,203
64
569
643

Thanks. I originally did \s but forgot the *, which is where I went wrong. Aside from using something like the HTML Dom Parser, what would you suggest for scraping results from a page? – Apr 03 '12 at 19:25
I posted some code for you to encourage you to use DOM parser. Any reason why you don't want to use DOM? – anubhava Apr 03 '12 at 19:28
You should use the DOM parser, please see the blog post in my answer by our own Jeff Atwood on the subject. – Chris Baker Apr 03 '12 at 19:29
1

Thanks guys. I read that blog post and like DomDocument much better so far. I just wasn't aware of it before. – Apr 03 '12 at 19:38
1

Good to hear! As you can see in the sample code from this answer and mine, `DomDocument` is not very hard to use if you're already familiar with DOM from coding javascript. A lot of people make the mistake of using regex on HTML because it is something they're familiar with, then people will shout at them "DON'T USE REGEX ON HTML" without providing the alternative. Now that you have to right tool, you'll find the job a lot easier :) +1 for a nice answer, @anubhava – Chris Baker Apr 03 '12 at 19:41

score 1 · Answer 2 · answered Apr 03 '12 at 19:19

1

SimpleHTMLDomParser will let you grab the content of a selected div or the contents of elements such as <p> <h1> <img> etc.

That might be a quicker way to achieve what your trying to do.

answered Apr 03 '12 at 19:19

Samwise

306
6
24

Trying to stay away from any external plugins. – Apr 03 '12 at 19:23

score 1 · Answer 3 · answered Apr 03 '12 at 19:23

The solution is to not use regular expressions on HTML. See this great article on the subject: http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html

Bottom line is that HTML is not a regular language, so regular expressions are not a good fit. You have variations in white space, potentially unclosed tags (who is to say the HTML you are scraping is going to always be correct?), among other challenges.

Instead, use PHP's DomDocument, impress your friends, AND do it the right way every time:

  // create a new DOMDocument
    $doc = new DOMDocument();

    // load the string into the DOM
    $doc->loadHTML('<td class="things"><div class="stuff"><p>I need to capture this text.</p></div></td>');

    // since we are working with HTML fragments here, remove <!DOCTYPE 
    $doc->removeChild($doc->firstChild);            

    // likewise remove <html><body></body></html> 
    $doc->replaceChild($doc->firstChild->firstChild->firstChild, $doc->firstChild);

    $contents = array();
    //Loop through each <p> tag in the dom and grab the contents
    // if you need to use selectors or get more complex here, consult the documentation
    foreach($doc->getElementsByTagName('p') as $paragraph) {
        $contents[] = $paragraph->textContent;
    } 

   print_r($contents);

Documentation

PHP's DomDocument - http://php.net/manual/en/class.domdocument.php
PHP's DomElement - http://www.php.net/manual/en/class.domelement.php

This PHP extension is regarded as "standard", and is usually already installed on most web servers -- no third-party scripts or libraries required. Enjoy!

How to regex scrape HTML and ignore whitespace and newlines in code?

3 Answers3