0
<div class="begin">...</div>

How to match the html inside(including) <div class="begin"> in PHP?

I need a regex solution that can handle nested case.

Sean Vieira
  • 155,703
  • 32
  • 311
  • 293
user198729
  • 61,774
  • 108
  • 250
  • 348
  • 12
    Uh-oh... Did you just write some HTML in a question and then tag it with `regex`? (See http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – Dominic Rodger Jan 07 '10 at 10:00
  • 9
    Here comes Zalgo, aka Tony the Pony. We're all doooomed – gnud Jan 07 '10 at 11:17

5 Answers5

11

Use DOM and DOMXPath instead of regex, you'll thank me for it:

// something useful:
function dumpDomNode ($node) {
    $temp = new DOMDocument();
    $temp->appendChild($node,true);
    return $temp->saveHTML();
}

$dom = new DOMDocument();
$dom->loadHTML($html_string);

$xpath-> new DOMXpath($dom);

$elements = $xpath->query("*/div/[@class='begin']");

foreach ($elements as $el) {
    echo dumpDomNode($el); // <-- or do something more useful with it
}

Trying this with regex will lead you down the path to insanity...

slebetman
  • 109,858
  • 19
  • 140
  • 171
2

This sums it up pretty good.

In short, don't use regular expressions to parse HTML. Instead, look at the DOM classes and especially DOMDocument::loadHTML

Community
  • 1
  • 1
Emil Vikström
  • 90,431
  • 16
  • 141
  • 175
  • 7
    No, you really, really, really, really don't. – Dominic Rodger Jan 07 '10 at 10:02
  • 1
    @unknown (google): why? Why would you choose a more error prone solution while you can use something like karim79 posted? – Bart Kiers Jan 07 '10 at 10:12
  • @Bart Kiers If you parse HTML with regex you are giving in to Them and their blasphemous ways which doom us all to inhuman toil for the One whose Name cannot be expressed in the Basic Multilingual Plane, H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ. – Zenadix Oct 29 '15 at 15:32
2

Here is your Regex:

preg_match('/<div class=\"begin\">.*<\/div>/simU', $string, $matches);

But:

  • RegEx do not know what XML/HTML elements are. To them, HTML is just a string. This is why the others are right. Regex are not for parsing a DOM. They are used to find string patterns.
  • I have provided the Regex because you do not intend to parse an entire HTML page, but just grab one defined piece of text from it, in which case a Regex is fine to use.
  • If there is a nested DIV inside the DIV, the Regex will not work as expected. If this is the case, do not use Regex. Use one of the other solutions, because then you need DOM parsing, not string matching.
  • For finding strings with a more or less clearly defined start and end, consider using regular string functions instead, as they are often quicker.
Gordon
  • 312,688
  • 75
  • 539
  • 559
1
// Create DOM from URL
$html = file_get_html('http://example.org/');

echo $html->find('div.begin', 0)->outertext;

http://simplehtmldom.sourceforge.net/manual.htm

karim79
  • 339,989
  • 67
  • 413
  • 406
0

here's one way using string methods

$str= <<<A
blah
<div class="begin">
blah blah
blah
blah blah </div>
blah
A;

$s = explode("</div>",$str);
foreach($s as $k=>$v){
    $m=strpos($v,'<div class="begin">');
    if($m !==FALSE){
        echo substr("$v" ,$m);
    }
}

output

$ php test.php
<div class="begin">
blah blah
blah
blah blah
ghostdog74
  • 327,991
  • 56
  • 259
  • 343