0

I'm trying to extract content of a div with PHP, independent of a class name and other parameters.

What I need is, I have multiline, single line, multiple parameter div such as


<div class="my-class additional-class"><div class="my-class2">
<div class="my-class"></div>
</div>

</div>

and I would like to get all the content of the first div, without the first div.

<div class="my-class2">
<div class="my-class"></div>
</div>

Normally, I'd guess

<div.*>(.*)<\/div>/mU

should have worked but I'm not sure why it doesn't.

I've came across this one

(?s)(?<=<div\sclass="test">\n).*(?=<\/div>)

which works with a class name test but I couldn't make it work as

(?s)(?<=<div.*>\n)(.*)(?=<\/div>)

Any help is appreciated.

Thank you,

MeCe
  • 445
  • 2
  • 5
  • 12
  • 4
    Better you use `DOM` parser – anubhava May 11 '21 at 06:43
  • DOM parser is my second option. I don't think it would work as good as regex in this matter. – MeCe May 11 '21 at 06:51
  • And why don't you think it would work as good as a regex in this matter? Considering that regular expressions are generally not capable enough to process the html language? And considering that a DOM parser _is_ capable of that? – arkascha May 11 '21 at 07:15
  • DOM parser doesn't work and needs time to figure out the correct encoding in some cases. You would need to figure out `mb_detect_encoding` and `mb_convert_encoding`. Also some users don't install XML on their server. – MeCe May 11 '21 at 07:29
  • Wrong dupe as this question is asking to find content of outermost `div` only and there is no answer that has an answer like provided below. – anubhava May 12 '21 at 16:13

3 Answers3

5

Here is a way to get it using DOM parser:

<?php
$html = '<div class="my-class additional-class"><div class="my-class2">
<div class="my-class"></div>
</div>
</div>';

$doc = new DOMDocument();
$doc->loadHTML($html); // loads your html
$elems = $doc->getElementsByTagName('div'); // find all div elements
$outerdiv = $elems->item(0); // outermost div
echo $outerdiv->childNodes[0]->C14N() . "\n"; // print inner HTML

/*
<div class="my-class2">
<div class="my-class"></div>
</div>
*/
?>

If you really want a regex solution then use:

~<div[^>]*>(.*)</div>~is

and grab capture group #1.

anubhava
  • 761,203
  • 64
  • 569
  • 643
  • 1
    Yep, that's exactly what I wanted. I don't know why I couldn't think of this :) Thank you – MeCe May 11 '21 at 07:18
3

Instead of .*, you should use [\s\S]* to match every character including new lines.

Here's a working example:

<div.*?>([\s\S]*)<\/div>

See the test case


Also if you want the tags must to be balanced, you could try this with recursion (?R):

<div.*?>((?:(?!<\/?div)[\s\S]|(?R))*)<\/div>

See the test case, notice it's not match the last </div> since there's no corresponding opening tag for it.

Hao Wu
  • 17,573
  • 6
  • 28
  • 60
0

Maybe you should use non-greedy solution:

<div.*?>(.*)</div>
Jiri Fornous
  • 402
  • 4
  • 10