Extract Content of a Div with PHP and Regex

Question

I'm trying to extract content of a div with PHP, independent of a class name and other parameters.

What I need is, I have multiline, single line, multiple parameter div such as


<div class="my-class additional-class"><div class="my-class2">
<div class="my-class"></div>
</div>

</div>

and I would like to get all the content of the first div, without the first div.

<div class="my-class2">
<div class="my-class"></div>
</div>

Normally, I'd guess

<div.*>(.*)<\/div>/mU

should have worked but I'm not sure why it doesn't.

I've came across this one

(?s)(?<=<div\sclass="test">\n).*(?=<\/div>)

which works with a class name test but I couldn't make it work as

(?s)(?<=<div.*>\n)(.*)(?=<\/div>)

Any help is appreciated.

Thank you,

DOM parser is my second option. I don't think it would work as good as regex in this matter. — MeCe, May 11 '21 at 06:51
And why don't you think it would work as good as a regex in this matter? Considering that regular expressions are generally not capable enough to process the html language? And considering that a DOM parser _is_ capable of that? — arkascha, May 11 '21 at 07:15
DOM parser doesn't work and needs time to figure out the correct encoding in some cases. You would need to figure out `mb_detect_encoding` and `mb_convert_encoding`. Also some users don't install XML on their server. — MeCe, May 11 '21 at 07:29
Wrong dupe as this question is asking to find content of outermost `div` only and there is no answer that has an answer like provided below. — anubhava, May 12 '21 at 16:13

score 5 · Accepted Answer · answered May 11 '21 at 07:10

Here is a way to get it using DOM parser:

<?php
$html = '<div class="my-class additional-class"><div class="my-class2">
<div class="my-class"></div>
</div>
</div>';

$doc = new DOMDocument();
$doc->loadHTML($html); // loads your html
$elems = $doc->getElementsByTagName('div'); // find all div elements
$outerdiv = $elems->item(0); // outermost div
echo $outerdiv->childNodes[0]->C14N() . "\n"; // print inner HTML

/*
<div class="my-class2">
<div class="my-class"></div>
</div>
*/
?>

If you really want a regex solution then use:

~<div[^>]*>(.*)</div>~is

and grab capture group #1.

Yep, that's exactly what I wanted. I don't know why I couldn't think of this :) Thank you — MeCe, May 11 '21 at 07:18

Hao Wu · Answer 2 · 2021-05-11T07:28:34.500

3

Instead of .*, you should use [\s\S]* to match every character including new lines.

Here's a working example:

<div.*?>([\s\S]*)<\/div>

See the test case

Also if you want the tags must to be balanced, you could try this with recursion (?R):

<div.*?>((?:(?!<\/?div)[\s\S]|(?R))*)<\/div>

See the test case, notice it's not match the last </div> since there's no corresponding opening tag for it.

edited May 11 '21 at 07:28

answered May 11 '21 at 07:18

Hao Wu

17,573
6
28
60

Wow, I didn't know `?R`, thank you. This works great :) – MeCe May 11 '21 at 07:25

score 0 · Answer 3 · answered May 11 '21 at 07:11

0

Maybe you should use non-greedy solution:

<div.*?>(.*)</div>

answered May 11 '21 at 07:11

Jiri Fornous

402
4
10

Extract Content of a Div with PHP and Regex

3 Answers3